Musings on Data Modelling
What is the fundamental purpose of Data Modelling? In my mind, it is about creating an understanding on how things work in real world, in a specific context, and draw a picture of it, the picture is the Data Model. To do that we need to define the things we are modelling and the reason we define things, events and so on is to achieve an understanding of the world we model, in a specific context. So the main purpose for doing a data model is to achieve understanding and not just your own enlightenment but also to create a common understanding with the people that are active in the context we are modelling. One thing I have noticed of the years is that good definitions of what we are modelling is often lacking. Look at the model below, which shows part of a model for the context of Card Payment, what is the first question that pops up in your mind?
Maybe you check the cardinalities on the relationship? Maybe you check the attributes in the boxes? Maybe you wonder why there is no name on the lines that goes to the relationship object (AccountCustomer). But the main question that should pop up in your mind is – what are the definitions on the entities and the relationship.
One big issue we have when doing data model sessions are well known “labels” (name on things). Because we are carrying our own definition of each of the labels we are working with, we have (our own) understanding of what an Account and a Customer is. We can relate to them. Therefore it is easy to jump over the step of defining the entities, since that might end up with a lot of discussion when we all of a sudden realize that the other people in the room have their very own definition of the entities in question. But wait, wasn’t that the reason why we should do this Data Model session in the first place – create common understanding.
Try this “mumbo jumbo” model instead
What is the first question that pops up in your mind? Probably, you now want to know what “Mumbo” etc. Stands for, what its definition is and so on. Since you don’t have an own definition/understanding of that label.
Now let’s return the industry I work in, Analytical systems design and implementation. A lot of time we work with goal to achieve common representation of data from an Enterprise context, that means the Customer defined, not only for the Card Payment process but what the Enterprise as a whole sees as a Customer. Now it starts to get really hard and probably the reason why a lot of people in my industry rather speak about “technical” parameters of Data Modelling as, if the Data Modelling technique is agile, reusable, if lends itself to automatic implementation patterns, what normalization level it has, how well it supports query performance etc. Instead on how do we create good definitions in a data modelling session? I include myself in that group of people; I do talk a lot about those Data Model “technical” things, but on the other hand, I really believe in the power of good definitions to achieve common understanding.
But let’s go back to the reason for definitions – to create a common understanding in the context you are modelling. How does that help us when implementing an Analytical System, well if you want to integrate data over system boundaries with a Subject Oriented view and not store data by system tables/files, you need the definitions. How could you otherwise know if the data from a specific system should enter the table/file in your Analytical System? So it is impossible to build a Subject Oriented Analytical System without definitions of the data model in the context (Enterprise, Division or Department etc.) that the Analytical System should represent data, so why is it, that almost everywhere you go, the people that work with definitions are none existent or very few, when it is a primary prerequisite to build that kind of system. We have a multitude of architects, designer, data modelers (the technical kind), and developers and so on but very few people working with creating common definitions
One reason I often hear when I talk to people in these projects is that they think it’s not their work, definitions on entities should already been established or someone else should do it, but then we end up in a waterfall approach, where we can’t implement data into our Analytical system, since we don’t have any definitions on entities. That is one of the reasons we also see more and more of these source system copy systems, we don’t have the time to wait on the definitions on entities, so let’s dump the source data into another machine and let the analysts figure out what the data is, which is the classic Copy System design, you can read about the issues (independent of modelling technique) with that design approach in this previous blog post.
In these days of agile development where sprint teams should be self-sufficient, everyone needs to know how to create a good definition otherwise they aren’t really self-sufficient.
When I check the latest Data Modelling/Big Data/Data Warehouse conferences that have been or are in the near future, there are no tracks or even single seminaries on, the importance of definitions and how to write a good definition in a data modelling session. It’s all about “technical” aspects on data modelling. The courses on Analytical Systems are all about data modelling technique, performance, implementation patterns, software and hardware, not on the importance of definitions and how to create them.
I think it’s time for our industry to start educate our self in creating, writing and maintaining good definitions and lift the importance of that for building Analytical Systems. 7 years ago (time flies) I wrote a paper on Definitions and their use for Integration of data. If you are interested to get a deeper explanation on my view for use of Definitions for data integration you can follow this link – ToM-Report-series Data Warehouse Defintions for Integration