Analytical Systems architecture and design patterns – Part 1

This series of blog posts will discuss system architecture and design principles to build an analytical data system with maintainability, scalability and longevity. Further on, discuss the details in the design principles to give the analytical data system agility, reusability and robustness.

These principles have been learned and explored the hard way, through empirical experiments (building analytical data systems) and observing the results over the last 40 years of what works and what does not work. To understand why the principles of architecture and design of an analytical data system are what they are, we need to go back in history before we knew what we know today.

The Copy System

In the 70’s and 80’s the data systems got more and more common. They were built to support different operational processes in the business. Even at this time the business understood that in all that data that was gathered in these operational systems there was knowledge to be used to support decision making. So, they wanted to use that data to understand how the company was doing. One issue was that the operational systems where fully occupied to support the business process and any analytical query against it would ruin its operational performance.

The solution was to offload the operational system on to another platform. Just copy the tables/files every month or so to the other platform and someone could start running queries against the data without disturbing the operational system.

Once the knowledge was extracted out to of an operational systems data, more questions from the business started to appear and these questions could not always be answered by data from one operational system, so we needed to off load data from more operational systems had, this was done into the same platform as the first system, just by copying each operational systems tables/files. So the Copy System appeared, the first analytical system.

Figure 1: The first Analytical System – The Copy System

Now there was a system that had data copied from multiple operational systems. The data was stored according to the operational systems data models. The data was raw, and the extraction and load of the data was data driven.

Once the data had landed on the new platform, smart people started to work with the data. These people had to have,

  • good knowledge of the Business,
  • good at data and data structures
  • good at programming

In the 70’s and early 80’s, programming against the data structures was often done in PL/1, Cobol, RPG etc. SQL became more popular later. Complex programs were written to extract, transform and merge data into reports for higher management to use in their business decisions. These programs were built to produce a very specific result/report. They were hard to maintain and often hard to expand in functionality, there were no ad hoc query possibility, since the programs where designed for a specific purpose. Another problem was that when more than one person worked on creating the same report, the result the different report builders came up with differed, which made it hard for the users of the reports to decided which report to believe.

So, we had following problems

  • Only people with high cross functional skills (Business, Data, Programming) could query the data
  • Different result on the same data, depending on who wrote the program
  • Programs hard to maintain and change
  • No ad hoc capability on the programs
  • Only aggregated data was accessible for user

Since the Copy Systems were very much an afterthought in the IT landscape and there were no good architectural design principles for how to build and maintain these systems, they often had big problems of handling change of requirements for queries and problem when the underlying systems, who’s data that was copied, changed. These problems would escalate as more systems was copied and more query requirements where implemented.

The Many to Many Challenge

What is the fundamental challenges of building these analytical systems then? It is that there are two sides in the system (Ingestion and Access) and they have a many to many relationships between them.

  • Ingestion Area must handle
    • Multiple information areas
    • Multiple IT systems
    • Multiple Data models
    • Multiple local semantics
    • Multiple local identifications
  • Access area must handle
    • Multiple user groups
    • Multiple analytical/report requirements/use cases
    • Multiple granularities

Figure 2: Analytical System Architecture – Many to Many dependencies

This is a major difference between building an operational system and an analytical system. An operational system supports a specific business process. These processes have clear steps on what they want to achieve. The input of data in clearly defined and the output is well defined. The number of use cases/scenario the system should support Is limited and the user group of the system is well defined and limited.  On the other hand, an Analytical system has many different data sources and a lot of different user groups with a lot of different requirements to support multiple different processes.

This means that the analytical system does not know all the different requirements/use cases it will have to support in its life cycle. So how do you then design a system that does not know what requirements it is supposed to support?

In the early days (70’s, 80’s, 90’s) the architecture looked like this, or the lack of architectural design patterns made it look like this.

Figure 3: Early Analytical System Architecture and design

As we can see it is a mess of spaghetti lines with multiple dependencies on data and usage of data. Any time a new requirement/use case entered the analytical system, new lines of dependencies between users and data sources was created. This was of course not maintainable or scalable over time as new requirements and systems came in.

So how do we handle these many to many relationships between Ingestion and access? We look how it is done in a data model! When running into a many to many relationships in 3NF modeling, we create an associative component to handle the relationships. That is exactly what needs to be done in the analytical system architecture.

The Associative Component

The associative component (see figure 4) must have a generic purpose for supporting reuse of data in different analytical components and is not built for one specific requirement/use case.

Now the definition of that associative entity is the important part on how to build an analytical system. Bill Inmon gave us that definition in the early 90’s and it still stands.

“a subject oriented, integrated, nonvolatile, time variant collection of data in support of management’s decisions”

Figure 4: Analytical System Architecture – the associative component

In the next part of this blog series I will describe my view on what the definition means and why it is so important if you want to build an Analytical System that will be maintainable and scalable over a long-time frame.