Analytical Systems architecture and design patterns – Part 1
Data Warehouse – People talk about it as it is something of the past, an experiment that went awfully wrong and now is better forgotten. New better ways of using data has come along. Long live Hadoop, Data Lake and Big Data analytics!
But wait, Hadoop is a hardware and software eco system for parallel processing of data, the Data Lake is a storage area of data on a HDFS file system and Big Data analytics are very much about using statistical models and formulas to extract knowledge out of data stored in a data lake. Those things can’t replace the Data Warehouse paradigm, they are not the same thing! It’s like replacing a road with a car.
For me, the word “Data Warehouse” is a word for system architecture and design principles to build an analytical system with maintainability, scalability and longevity. Further on the principles are to give the analytical system agility, reusability and robustness.
These principles have been learned and explored the hard way, through empirical experiments (building analytical systems) and observing the results over the last 40 years of what works and what does not work. To understand why the principles of architecture and design of an analytical system are what they are, we need to go back in history before we knew what we know today.
The Copy System
In the 70’s and 80’s the data systems got more and more common. They were built to support different operational processes in the business. Even at this time the business understood that in all that data that was gathered in these operational systems there was knowledge to be used to support decision making. So, they wanted to use that data to understand how the company was doing. One issue was that the operational systems where fully occupied to support the business process and any analytical query against it would ruin its operational performance.
The solution was to offload the operational system on to another platform. Just copy the tables/files every month or so to the other platform and someone could start running queries against the data without disturbing the operational system.
Once the knowledge was extracted out to of an operational systems data, more questions from the business started to appear and these questions could not always be answered by data from one operational system, so we needed to off load data from more operational systems, this was done into the same platform as the first system, just by copying each operational systems tables/files. So the Copy System came to be, the first analytical system.
Figure 1: The first Analytical System – The Copy System
Now there was a system that had data copied from multiple operational systems. The data was stored according to the operational systems data models. The data was raw, and the extraction and load of the data was data driven. That sounds similar to a Data Lake, right?
Once the data had landed on the new platform, smart people started to work with the data. These people had to have,
- good knowledge of the Business,
- good at data and data structures
- good at programming
Sounds very familiar with another “hot” analytical work description today, right?
In the 70’s and early 80’s, programming against the data structures was often done in PL/1, Cobol, RPG etc. SQL became more popular later. Complex programs were written to extract, transform and merge data into reports for higher management to use in their business decisions. These programs were built to produce a very specific result/report. They were hard to maintain and often hard to expand in functionality, there were no ad hoc query possibility, since the programs where designed for a specific purpose. Another problem was that when more than one person worked on creating the same report, the result the different report builders came up with differed, which made it hard for the users of the reports to decided which report to believe.
So, we had following problems
- Only people with high cross functional skills (Business, Data, Programming) could query the data
- Different result on the same data, depending on who wrote the program
- Programs hard to maintain and change
- No ad hoc capability on the programs
- Only aggregated data was accessible for user
At this point, in the mid to late 80’s, relational databases started to grow in maturity and SQL became more popular. and the first formalized by ANSI version came along 1986.
In the late 80’s and the beginning of the 90’s, innovative ideas came along when it came to processing data. The idea was that we should give the users access directly to the data. Instead of static reports created by complex programs. Self Service was the idea (nothing new under the sun).
So, people started to create data structures that would support ad hoc queries for users. Here the early versions of Dimension Modelling started to appear. The creation of tables to be accessed for reporting was created and SQL and RDBMS started to get more popular, because of the “smartness” built in to the optimizer in the RDBMS software, that made it easier to support ad hoc access on data structures.
These were of course created on the data that had been moved earlier to the independent copy platform. The copy system held the raw data from the different operational systems and now people started to build/prepare data structures to support ad hoc querying. Still these data structures where built for specific purpose of reporting and had only a limited and aggregated data set to ensure performance and no drill down to details existed.
Still a lot of problem existed in these environments.
- Only people with high cross functional skills (Business, Data, Programming) could build the data structures for ad hoc querying
- Different result on the same data, depending on who wrote the program
- Programs that created the ad hoc query tables where specific for each query table
- Only aggregated data was accessible
The problems where almost the same as earlier but one thing had happened. The reports didn’t always had to be static. Some ad hoc capabilities had been fulfilled.
Since the Copy Systems were very much an afterthought in the IT landscape and there were no good architectural design principles for how to build and maintain these systems, they often had big problems of handling change of requirements for queries and problem when the underlying systems, who’s data that was copied, changed. These problems would escalate as more systems was copied and more query requirements where implemented.
The Many to Many Challenge
What is the fundamental challenges of building these analytical systems then? It is that there are two sides in the system (Ingestion and Access) and they have a many to many relationships between them.
- Ingestion Area must handle
- Multiple information areas
- Multiple IT systems
- Multiple Data models
- Multiple local semantics
- Multiple local identifications
- Access area must handle
- Multiple user groups
- Multiple analytical/report requirements/use cases
- Multiple granularities
Figure 2: Analytical System Architecture – Many to Many dependencies
This is a major difference between building an operational system and an analytical system. An operational system supports a specific business process. These processes have clear steps on what they want to achieve. The input of data in clearly defined and the output is well defined. The number of use cases/scenario the system should support Is limited and the user group of the system is well defined and limited. On the other hand, an Analytical system has many different data sources and a lot of different user groups with a lot of different requirements to support multiple different processes.
This means that the analytical system does not know all the different requirements/use cases it will have to support in its life cycle. So how do you then design a system that does not know what requirements it is supposed to support?
In the early days (70’s, 80’s, 90’s) the architecture looked like this, or the lack of architectural design patterns made it look like this.
Figure 3: Early Analytical System Architecture and design
As we can see it is a mess of spaghetti lines with multiple dependencies on data and usage of data. Any time a new requirement/use case entered the analytical system, new lines of dependencies between users and data sources was created. This was of course not maintainable or scalable over time as new requirements and systems came in.
Sadly enough, we can today see that many analytical systems on new platforms are using exactly this way of building and we are starting to see these systems starting to fail from a maintainability and code scalability perspective.
What the Data Warehouse community learned in the 70’s, 80’s and 90’s has not carried on to the people building Analytical systems on new technology platforms. People believe that the design principles for Data Warehouse is dependent on technology and use cases, but it’s not. Remember, Data Warehouse architecture has nothing to do with technology, it is a logical design pattern to handle the challenges an Analytical system must face.
The Associative Component
So how do we handle these many to many relationships between Ingestion and access? We look how it is done in a data model! We create an associative component to handle the many to many dependencies. That is exactly the conclusion that people like Barry Devlin, Bill Inmon and more came up with in the beginning of the 90’s.
Figure 4: Analytical System Architecture – the associative component
Now the definition of that associative component is the important part on how to build an analytical system. Bill Inmon gave us that definition in the early 90’s and it still stands.
“a subject oriented, integrated, nonvolatile, time variant collection of data in support of management’s decisions”
In the next part of this blog series I will describe my view on what the definition means and why it is so important if you want to build an Analytical System that will be maintainable and scalable over a long-time frame.