Analytical Systems architecture and design patterns – Part 2

This is the second part of the design pattern of an analytical system, for more background read the previous blog post –Link–

The blog series is aiming at explaining how and why we need to build an Analytical Systems in a specific way, if we want them to have maintainability, scalability and longevity.

The Data Warehouse definition – the definition of an Analytical System

In the previous blog post we looked at the history of building analytical systems and how in the early 90’s we came to the conclusion that we need an “Associative Component” in the architecture to handle the many to many relationships between source data and user groups requirements.

I also wrote that the associative component in the architecture has to follow the Data Warehouse definition that Bill Inmon gave us in the early 90’s

“a subject oriented, integrated, nonvolatile, time variant collection of data in support of management’s decisions”

Now the definition can be interpreted in many ways, I will here describe my view of what this definition means and why it is important from an Analytical system design perspective.

Subject Oriented

Subject Oriented or Subject Area Oriented is an important part of how an Analytical System are organized.

For the Associative Component in the Analytical System Architecture we organize and handle our data in subject areas. This way we always know where to get certain kinds of data, independent of source of data. If we would store data according to source of data we would end up with a lot of various places to pick up data if the question about data spanned over multiple sources for a specific data concept, which most of the queries in an Analytical System does.

It might feel like a very small thing, but this is a fundamental part of designing and building an Analytical System. Then main reason why the subject area approach is so important is that it gives the analytical system the ability to disconnect the dependency between the source of data and the end user requirements. Since the sources of data that enters an analytical system is not created by the analytical system itself. It is dependent on the underlying systems that feeds it data. The underlying sources of data can and will over time be changed, and even removed from the IT landscape and replaced by new systems. If we build a direct dependency on specific source data and they change or even replaced, you will have a much broader effect on the analytical system than if we build it with an Subject Area Associative Component.

Integrated

Integrated can be seen as three different areas of integration. Structure (Subject Area), Instance and Semantic. That means that we create a data representation in the Associative Component that has the same Structure and semantic representation independent of source system.

I break it down in 6 levels, where each level is fundamental if the Associative Component will hold the data in a format the remove the many to many dependency between Source and User. Which is the really important if you want to build an Analytical System that is maintainable, scaleable and have a long lifespan as a system.

  1. Sourcing: The fundamental part of an analytical system is that we need to get our hands on the data. This can be done in many ways, but we do need to have the ability to use data from various sources.
  2. Subject Area: This is also called co-location of data. We organize data from various sources according to what Subject Area it belongs to, not by each source.
  3. Instance: This is a very important part of integrated, it’s about, that if a data instance within a Subject Area exist in multiple sources, the Associative Component in the analytical system has to have the ability integrate the various sources and only have one instance of it. Example, if a Customer exists in multiple systems and they all send data about the Customer the Analytical System should only have one instance of that Customer, independent of how many sources has information about that specific Customer.
  4. Attribute: The various sources of data often has their own naming conventions on the attributes. The Associative Component on the other hand has to keep its data according to one naming convention. This to make it possible to access data independent of source system.
  5. Domain Values: Certain Attributes has their stable values – it is like a list of values that a certain attribute can use. The source often has their own list of values that certain attributes can use, these must be translated as well. Otherwise you can’t really access the data independent of source.
  6. Format: We want the format for different type of data types to be the same, like date format. There are multiple ways of representing a date, but the Associative Component should have one.

If you handle data according the definition of the Associative Component instead by Source it will cater for a much stronger scalability, maintainability and a longer life cycle of the Analytical System. 

Let’s see how some different change scenarios affects an Integrated Associative Component versus a Source by Source architecture. (of course, these are simplified effects described but they represent the fundamental effects as a change happens) 

Base line – Eight systems handles product information that thirty different analytical functions has been coded against, there are reports, statistical models and AI scripts using the data. 

  1. Now we add a new source system of product data.  
    • Source by Source – All thirty analytical functions have to be rewritten to handle the new source. 
    • Associative Component – One system has to be mapped into the Product Subject Area and Integrated according to the six levels of Integration. Use Cases are untouched. 
  2. We want to create a new analytical function for products 
    • Source by Source – Nine different sources have to be mapped and understood in regards of the requirement of the analytical function. Then nine different sources of data have to be handled in the code, merged and integrated be the developer of the analytical function. The developer also need deep understanding on how each source systems works and all its legacy in its data to come up with a correct result. 
    • Associative Component – The developer accessing the subject area for product and need to translate the requirement to that subject areas data definitions. The developer need to understand the data model of the Product Subject Area but does not need to understand each and every source system. 
  3. An old system gets replaced with a new one 
    • Source by Source – All thirty analytical functions have to be rewritten to handle the new source. 
    • Associative Component – One system has to be mapped into the Product Subject Area and Integrated according to the six levels of Integration. Use Cases are untouched. 

These simple examples show the fundamental strength of the Associative Component when it comes to handle changes between the many to many relationships between the User and the source of data and simplify the usage of data. This way the Analytical System can grow and change both on the Source side as well as the User side and the Associative Component will help mitigate the effect of the change and in the long run create a more efficient development cycle. If you want to build an Analytical System that is be scalable, maintainable and have a long-life span, you need an Associative Component in your Analytical System architecture.