Archive For: Publicerat

ToM Report Series on Data Warehouse – A Snowflake and Synapse comparison

ToM Report Series on Data Warehouse – A Snowflake and Synapse comparison

This white paper is directed at decision makers and data architects who are thinking about implementing a cloud data warehouse solution. It presents the result of a detailed comparison between two large products in this space, Snowflake and Microsoft Synapse (formerly known as Azure SQL Data Warehouse).

This comparison has been done both using on-paper evaluations and hands-on-keyboards physical testing of functions, providing a strong argument for why Snowflake is currently (March 2020) a better option in most cases. In the paper contains a grading matrix where each evaluation area is summarized for an overall score for each solution.

 

Musings on Data Modelling

What is the fundamental purpose of Data Modelling? In my mind, it is about creating an understanding on how things work in real world, in a specific context, and draw a picture of it, the picture is the Data Model. To do that we need to define the things we are modelling and the reason we define things, events and so on is to achieve an understanding of the world we model, in a specific context. So the main purpose for doing a data model is to achieve understanding and not just your own enlightenment but also to create a common understanding with the people that are active in the context we are modelling. One thing I have noticed of the years is that good definitions of what we are modelling is often lacking. Look at the model below, which shows part of a model for the context of Card Payment, what is the first question that pops up in your mind?

Maybe you check the cardinalities on the relationship? Maybe you check the attributes in the boxes? Maybe you wonder why there is no name on the lines that goes to the relationship object (AccountCustomer). But the main question that should pop up in your mind is – what are the definitions on the entities and the relationship.

One big issue we have when doing data model sessions are well known “labels” (name on things). Because we are carrying our own definition of each of the labels we are working with, we have (our own) understanding of what an Account and a Customer is. We can relate to them. Therefore it is easy to jump over the step of defining the entities, since that might end up with a lot of discussion when we all of a sudden realize that the other people in the room have their very own definition of the entities in question. But wait, wasn’t that the reason why we should do this Data Model session in the first place – create common understanding.

Try this “mumbo jumbo” model instead

What is the first question that pops up in your mind? Probably, you now want to know what “Mumbo” etc. Stands for, what its definition is and so on. Since you don’t have an own definition/understanding of that label.

Now let’s return the industry I work in, Analytical systems design and implementation. A lot of time we work with goal to achieve common representation of data from an Enterprise context, that means the Customer defined, not only for the Card Payment process but what the Enterprise as a whole sees as a Customer. Now it starts to get really hard and probably the reason why a lot of people in my industry rather speak about “technical” parameters of Data Modelling as, if the Data Modelling technique is agile, reusable, if lends itself to automatic implementation patterns, what normalization level it has, how well it supports query performance etc. Instead on how do we create good definitions in a data modelling session? I include myself in that group of people; I do talk a lot about those Data Model “technical” things, but on the other hand, I really believe in the power of good definitions to achieve common understanding.

But let’s go back to the reason for definitions – to create a common understanding in the context you are modelling. How does that help us when implementing an Analytical System, well if you want to integrate data over system boundaries with a Subject Oriented view and not store data by system tables/files, you need the definitions. How could you otherwise know if the data from a specific system should enter the table/file in your Analytical System? So it is impossible to build a Subject Oriented Analytical System without definitions of the data model in the context (Enterprise, Division or Department etc.) that the Analytical System should represent data, so why is it, that almost everywhere you go, the people that work with definitions are none existent or very few, when it is a primary prerequisite to build that kind of system. We have a multitude of architects, designer, data modelers (the technical kind), and developers and so on but very few people working with creating common definitions

One reason I often hear when I talk to people in these projects is that they think it’s not their work, definitions on entities should already been established or someone else should do it, but then we end up in a waterfall approach, where we can’t implement data into our Analytical system, since we don’t have any definitions on entities. That is one of the reasons we also see more and more of these source system copy systems, we don’t have the time to wait on the definitions on entities, so let’s dump the source data into another machine and let the analysts figure out what the data is, which is the classic Copy System design, you can read about the issues (independent of modelling technique) with that design approach in this previous blog post.

In these days of agile development where sprint teams should be self-sufficient, everyone needs to know how to create a good definition otherwise they aren’t really self-sufficient.

When I check the latest Data Modelling/Big Data/Data Warehouse conferences that have been or are in the near future, there are no tracks or even single seminaries on, the importance of definitions and how to write a good definition in a data modelling session. It’s all about “technical” aspects on data modelling. The courses on Analytical Systems are all about data modelling technique, performance, implementation patterns, software and hardware, not on the importance of definitions and how to create them.

I think it’s time for our industry to start educate our self in creating, writing and maintaining good definitions and lift the importance of that for building Analytical Systems. 7 years ago (time flies) I wrote a paper on Definitions and their use for Integration of data. If you are interested to get a deeper explanation on my view for use of Definitions for data integration you can follow this link – ToM-Report-series Data Warehouse Defintions for Integration

 

Analytical Systems architecture and design patterns – Part 2

This is the second part of the design pattern of an analytical system, for more background read the previous blog post –Link–

The blog series is aiming at explaining how and why we need to build an Analytical Systems in a specific way, if we want them to have maintainability, scalability and longevity.

The Data Warehouse definition – the definition of an Analytical System

In the previous blog post we looked at the history of building analytical systems and how in the early 90’s we came to the conclusion that we need an “Associative Component” in the architecture to handle the many to many relationships between source data and user groups requirements.

I also wrote that the associative component in the architecture has to follow the Data Warehouse definition that Bill Inmon gave us in the early 90’s

“a subject oriented, integrated, nonvolatile, time variant collection of data in support of management’s decisions”

Now the definition can be interpreted in many ways, I will here describe my view of what this definition means and why it is important from an Analytical system design perspective.

Subject Oriented

Subject Oriented or Subject Area Oriented is an important part of how an Analytical System are organized.

For the Associative Component in the Analytical System Architecture we organize and handle our data in subject areas. This way we always know where to get certain kinds of data, independent of source of data. If we would store data according to source of data we would end up with a lot of various places to pick up data if the question about data spanned over multiple sources for a specific data concept, which most of the queries in an Analytical System does.

It might feel like a very small thing, but this is a fundamental part of designing and building an Analytical System. Then main reason why the subject area approach is so important is that it gives the analytical system the ability to disconnect the dependency between the source of data and the end user requirements. Since the sources of data that enters an analytical system is not created by the analytical system itself. It is dependent on the underlying systems that feeds it data. The underlying sources of data can and will over time be changed, and even removed from the IT landscape and replaced by new systems. If we build a direct dependency on specific source data and they change or even replaced, you will have a much broader effect on the analytical system than if we build it with an Subject Area Associative Component.

Integrated

Integrated can be seen as three different areas of integration. Structure (Subject Area), Instance and Semantic. That means that we create a data representation in the Associative Component that has the same Structure and semantic representation independent of source system.

I break it down in 6 levels, where each level is fundamental if the Associative Component will hold the data in a format the remove the many to many dependency between Source and User. Which is the really important if you want to build an Analytical System that is maintainable, scaleable and have a long lifespan as a system.

  1. Sourcing: The fundamental part of an analytical system is that we need to get our hands on the data. This can be done in many ways, but we do need to have the ability to use data from various sources.
  2. Subject Area: This is also called co-location of data. We organize data from various sources according to what Subject Area it belongs to, not by each source.
  3. Instance: This is a very important part of integrated, it’s about, that if a data instance within a Subject Area exist in multiple sources, the Associative Component in the analytical system has to have the ability integrate the various sources and only have one instance of it. Example, if a Customer exists in multiple systems and they all send data about the Customer the Analytical System should only have one instance of that Customer, independent of how many sources has information about that specific Customer.
  4. Attribute: The various sources of data often has their own naming conventions on the attributes. The Associative Component on the other hand has to keep its data according to one naming convention. This to make it possible to access data independent of source system.
  5. Domain Values: Certain Attributes has their stable values – it is like a list of values that a certain attribute can use. The source often has their own list of values that certain attributes can use, these must be translated as well. Otherwise you can’t really access the data independent of source.
  6. Format: We want the format for different type of data types to be the same, like date format. There are multiple ways of representing a date, but the Associative Component should have one.

If you handle data according the definition of the Associative Component instead by Source it will cater for a much stronger scalability, maintainability and a longer life cycle of the Analytical System. 

Let’s see how some different change scenarios affects an Integrated Associative Component versus a Source by Source architecture. (of course, these are simplified effects described but they represent the fundamental effects as a change happens) 

Base line – Eight systems handles product information that thirty different analytical functions has been coded against, there are reports, statistical models and AI scripts using the data. 

  1. Now we add a new source system of product data.  
    • Source by Source – All thirty analytical functions have to be rewritten to handle the new source. 
    • Associative Component – One system has to be mapped into the Product Subject Area and Integrated according to the six levels of Integration. Use Cases are untouched. 
  2. We want to create a new analytical function for products 
    • Source by Source – Nine different sources have to be mapped and understood in regards of the requirement of the analytical function. Then nine different sources of data have to be handled in the code, merged and integrated be the developer of the analytical function. The developer also need deep understanding on how each source systems works and all its legacy in its data to come up with a correct result. 
    • Associative Component – The developer accessing the subject area for product and need to translate the requirement to that subject areas data definitions. The developer need to understand the data model of the Product Subject Area but does not need to understand each and every source system. 
  3. An old system gets replaced with a new one 
    • Source by Source – All thirty analytical functions have to be rewritten to handle the new source. 
    • Associative Component – One system has to be mapped into the Product Subject Area and Integrated according to the six levels of Integration. Use Cases are untouched. 

These simple examples show the fundamental strength of the Associative Component when it comes to handle changes between the many to many relationships between the User and the source of data and simplify the usage of data. This way the Analytical System can grow and change both on the Source side as well as the User side and the Associative Component will help mitigate the effect of the change and in the long run create a more efficient development cycle. If you want to build an Analytical System that is be scalable, maintainable and have a long-life span, you need an Associative Component in your Analytical System architecture. 

 
ToM Focal – Introduction to Focal implementation

ToM Focal – Introduction to Focal implementation

The Focal pattern was implemented the first time as a Data Warehouse in the year 1998. That Data Warehouse is still running on the same base code today. Over the years the Focal pattern has been used and improved on many different implementations.

Today at Top of Minds Focal company, we work with the continuation of the Focal Framework and teaches others how to use it.

This document gives an introduction to how a Focal Data Warehouse works, both from a modelling perspective, but as well from an implementation perspective. All examples are simplified for ease of understanding.

 
ToM Report Series on Data Warehouse – Removing duplicates using Informatica Powercenter

ToM Report Series on Data Warehouse – Removing duplicates using Informatica Powercenter

There is more than one way to skin a cat, and there are four ways to choose one distinct row of several possibilities in an Informatica PowerCenter mapping; Rank, Sort, Aggregate and the sorted Expression-filter technique. Each method has its own pros and cons, and a good developer should know when to use which technique. This white paper detail four different methods and discuss when each one of them should be used. Ideally, you want to have more than one possible solution for each scenario you encounter, and determine which ultimate technique to use by comparing run time, memory usage and other parameters of importance to your configuration.

 
ToM Report Series on Data Warehouse – Write generic code to limit codebase and gain performance in Informatica PowerCenter

ToM Report Series on Data Warehouse – Write generic code to limit codebase and gain performance in Informatica PowerCenter

Ever changing business needs and requirements are forcing BI-projects to be more agile than ever before. The value of “seeing the data” can’t be emphasized enough in a BI-project which is why too rigid implementation processes and long delivery sprints should be avoided. Data Vault is one of the most flexible modelling techniques that is used today, much thanks to its abilities to cope and adapt with fast changing realities and increasing needs to extend existing data models.

When working with Data Vault or other ensemble modelling techniques the different building blocks; Hubs, Links and to some extent Satellites are well defined and all share a common structure that can be used to build generic code. With the use of generic code, the code base can be held to a minimum and hence the time to value can be shortened. One should be aware that one downside with generic code can be a greater complexity and reduced data load speed if you aren’t careful in the design and implementation.

This report will give the reader hands on design examples in Informatica PowerCenter, to be used for writing generic, agile code without losing performance in the data load. Because of its suitability Data Vault is chosen to exemplify the performance gain.

 
ToM Report Series on Data Warehouse – Using Informatica Powercenter to automate your Data vault

ToM Report Series on Data Warehouse – Using Informatica Powercenter to automate your Data vault

With the ongoing success and sustained rapid growth of the Data vault modelling standard, the need for robust and easy-to-use code for loading and extracting data is ever increasing. However, many companies struggle with non-reusable ETL-solutions that cost too much to maintain and doesn’t scale effectively. By the end of this white paper, you will know how and why Informatica Powercenter is an excellent tool for developing a standardized set of reusable ETL-code, ensuring a stable, scalable data warehouse solution that is easy to maintain over time.

 
ToM Report Series on Information Integration – Hyper Agile Design Pattern

ToM Report Series on Information Integration – Hyper Agile Design Pattern

In the Report “ToM Report Series on Information Integration – IT Support of Business Agility” we looked at the idea of building flexible information integration solutions, on a high level, that could

  • Quickly adapt too today’s fast moving business environment
  • “Absorb” new business information requirements
  • Give the business users the power of how the information should be translated into a common business information view
  • Ability to set up rules for how the information should be used by business processes.

This report will focus on how to design a Hyper Agile Data Warehouse.
We will look at different design patterns and how to combine them to create a Hyper Agile Design Pattern. This pattern can be used for many different information integration solutions, Master Data, Data Warehouse, Operational Information Integration etc.