DV Modelling Is Hard

Hi all,

Apologies in advance for my rant below.

So I’m one cog in a machine that is trying to build a data vault. The machine has been running for a while now and I’m coming in a bit blind. I’m doing my best to understand how I should model, but finding that I’m constantly changing my mind on the right model structure for the data I’m touching.

We are at the raw vault stage, modelling source by source, for a subset of source tables defined as being in scope.

Some of high level issues I’m struggling with are:

  • DV is so much more than just ‘Hubs, Links and Satellites’ - all up I can see 8 DV entity types (Hubs, Links, Sats, MSats, MLinks, EffSats, NHLinks, SALinks)
  • Most DV examples assume perfect data structures that just doesn’t matter reality - if I see another “sales, customers, orders”
  • Source data varies so much from one source to the next
  • What I’ve seen modelled to date in the organisation doesn’t look very consistent
  • Going to AI models to help has been the opposite of help - keeps giving different answers, often when I give them no additional information but just remind them of a prior response
  • There are constraints applied in the organisations modelling approach that means many responses / examples don’t apply (e.g. pre-processing, staging, splitting of tables etc.)
  • But most of all I can’t see a simple decision tree of “when you have this situation, then model like that”

Maybe I’m asking too much, but I just don’t see how a modelling technique can be successfully implemented, at scale, by multiple people, with organisational constraints they can’t necessarily control, when there is so much subjectivity involved.

Is it a ‘true business concept’? I don’t know, you tell me. Go ask the business? Yeah cause they are chomping at the bit to talk data models….

Is this a genuine business key? Again, I don’t know. Who talks about order 12345? yeah a few people. Customer_ID 12345 not so many. DocumentationRequest 12345 hardly any.

How should I model this table? Well if it is this type of data, then do this, otherwise do that. Oh, unless it is a FK to another table, in which case do this instead. But also, don’t listen to person X, do this instead. No listen to me and do my way instead.

Or even better, well it depends on how the data is going to be used. Yeah so I should ask the 10000 potential users individually on what they will do with the data? And do they even know how they’ll use it?

Highly normalised data? good luck. Lots of NULLs? good luck. 20FKs in your table? here are the 55 different approaches you can take. 100+ data sources needing to be delivered within a reasonable programme type timeframe? very very very good luck.

But the best of all is, you want to know more? Buy my book, get certified, pay lots for some training, read my paywalled blog etc. etc.

I can generally pick things up reasonably quickly, but DV modelling has just stumped me. I am pretty methodological and like to follow a pattern / process / rinse and repeat type approach to things, and my brain feels like it is about to explode with all the research I’m doing without getting any closer that pattern / process / rinse and repeat approach.

And really if it isn’t simple, isn’t close to black and white, doesn’t result in outputs 80%+ similar not matter who is modelling, then why is it valuable?

Can’t we just do one Hub and one Sat per source table and be done with it?

2 Likes

Hey JustOneCog!

I love this post, thank you for the rant. For many people in the DV2.0 space this is a question they’ve heard to death but people forget that this is the genuine frustration people have with Data Vault at the end of the day. I appreciate you outlining your specific ills with the process so it’s not just a snowball fight.

I’ll try to give my two cents but you’ve covered a lot of ground here so please excuse my picking and choosing from the topics you’ve mentioned.

“sales, customers, orders” & Source data variety

I find a lot of your high level issues are less about Data Vault and more about data in general. Conforming sources is hard when all the data is completely different from one another. Unclean data with errors and gaps are all too common in the industry, DV2.0 deals with this about as well as any other architecture. We cant make data out of thin air!

This is obviously simplified when people talk about perfect data in sales/customers/orders. But the fundamental reason these things are frustrating is because there are many steps before you start planning out your tables.

To be frank (which I am) Data Vault front-loads the pain which has been awful for its reputation :sweat_smile:. These are problems you’ll hit down the line with any architecture. If you’re building a dimensional model and your dim_customers table has many sources with completely different data in them and subtly different definitions of what a customer is then you’re going to have to reconcile this somehow. You’re going to need to ask the business what they care about to include in it and find solutions to the same problems that DV asks you to solve ahead of time.

Go ask the business? Yeah cause they are chomping at the bit to talk data models….

True! Communication is the trickiest part of the job. I take it you’re exaggerating as people are normally quite excited to talk about their work and how it functions behind the scenes. Talking about data models is naff but that’s why DV tries to mostly talk about business function and then translate that into a data model. If anything it makes it easier to build a map of the key business areas and how they interact than other methodologies.

We are at the raw vault stage

This is a nitpick so ignore if you fancy. Ideally you’d build out one source end to end and then move onto the next. Helps to provide bitesize value to the business and start iteration earlier. As you say though there are a lot of cogs in this machine and decisions like this are sometimes out of our hands.

Buy my book, get certified, pay lots for some training, read my paywalled blog etc. etc.

Dude completely agree, I’m sure there’s a good reason for it because DV is a tricky subject those who understand it hold it as a trade secret. But DV as a concept would do so much better with some more available resources. You can buy my book on this for more info (joke). But a small ad if you’re interested I think @neil.strange still runs a free intro course called DV core concepts.

8 DV entity types

This is true, depending on your perspective. People often sell DV on this “big 3” simplicity but it’s more that they’re descriptive flavours of the same idea. Hubs links and satellites lets you know how the hub and spokes work in DV but not the specifics of the data being stored there. MASats are just sats where you allow a child key to discern multiple values, SALinks are just links that translate from one context to another. They’re all minor differences on the original approach. A few of them have the exact same code and it’s just the context they’re used in that helps modellers understand the function of each table.

That being said I hear your concern. If you go in with the big 3 at heart then you’ll normally stumble into the others when you’re trying to apply your data to the business model. That’s the experience I had at least.

it isn’t simple, isn’t black and white, not 80%+ similar. Why is it valuable?

All software architecture decisions have a trade off. Not all projects are serviced well by a Data Vault solution. Black & white and 80% similar is a matter of implementatino so I’ll leave that out but so many people struggle with DV it’s hard to call it simple so if this is one of the business’ core requirements then they may be better serviced by another warehouse architecture. But it seems that you’re dealing with large amounts of data from many differring sources conforming to a unified view on the business. This is DV bread and butter. Some projects value the robustness, expandability and auditability of their warehouse over it’s simplicity. Just a matter of priority really.

Can’t we just do one Hub and one Sat per source table and be done with it?

I wish! I think this architecture is called “Your source system but worse”. It’d work but would serve no benefit over just cloning the source system in raw.

That’s it, I think my rant-in-kind is over. Missed a couple of the key bits but I think I’ve forced everyone to read plenty. Shout if there’s more you’d like to discuss.

Frankie

1 Like

First of all, welcome to the community and thank you for your open, honest post. It’s great to see :slight_smile:

Let’s take a step back here and remind ourselves what the point of all this is, and hopefully we can help you find a path forward - there’s a lot to unpack here though so apologies if I miss anything.

Let’s get one thing out of the way early: Modelling is hard, it requires thought (for better or worse!)

This inevitably means you will not get it right first time, and that’s OK. Data Vault itself has many agile principles built into it and I could spend a while here talking about all of that, but let’s focus on the important bit in this context: you can build, iterate, see if it works, realise how wrong you were and go around the loop again - do not be afraid to get it wrong and improve it over time.

Next, we should remember that Data Vault is built on business-first delivery. Hubs, Links Satellites and the whole DV Structure Gang is actually secondary and merely the engine/building blocks to achieve the outcome. When modelling, you don’t go straight to these structures, you must understand the business processes first - this means Concept Modelling, Process Modelling and the rest.

To this end, your conversations with the business don’t actually need to look like modelling conversations, it can be an open discussion about how the business works, what their job looks like (if applicable to the model), how Entity A is processed and reaches State X etc.

Then we have Top-down and Bottom-Up modelling. If you start Top-down with what the end user wants to achieve - not all 10,000, hopefully just a few key stakeholders if you can access them - I appreciate this a whole other story in some orgs, though! - then this gives you a target to aim for.

This then gives you a direction about what source data you may need, but don’t start with everything, start small. Bring in just enough tables to pilot your data model and check if it makes possible the end-result that you’ve discussed with your users.

To your point about having a simple decision tree to model x like y in a given situation: I hear you. The golden rule to remember here is making sure you can answer “Yes” to the following question: “Does my model model the business process in a way the business understands it?”

If you can, then this is enough to get started. Over time, experienced modellers will have a standard way to model certain things, and know enough to understand when to deviate or alter the standard approach for the quirks or unique processes of a particular business. If you’re not at this stage yet: don’t be afraid to give it a go and iterate on it, as I said earlier.

One final thing to remember is that the situation your are in and the way you are feeling is by no means unique to Data Vault. Most of this would be happening with or without Data Vault. The bit Data Vault does is give you a standard, pattern-based structure to implement your data model once you’ve figured out what that is -Creating your physical hubs, links, satellites and the rest are actually the easy bit relative to the bigger picture.

I will leave it there for now, as I feel like I’ve rambled a bit - but hopefully this gives you some insight on the reality of Data Modelling: Business-first but perfectly imperfect.

1 Like

First of all, thank you to both Frankie and Alex for taking the time to respond in detail to my rant. Rather than matching my frustrated / annoyed tone you calmly responded with a “we understand” type of reply which I do appreciate. My post was probably more a therapeutic exercise than actively seeking a response :).

Secondly thanks for recognising that the ‘ideal DV’ in theory isn’t what always pans out in practice. I feel that many responses that effectively say “you are doing it wrong” don’t understand that modellers generally don’t have the authority to make wholesale changes to standards and practices that their organisations employ, nor should they.

My original rant was my frustrations getting to me as I slowly felt like I was getting the hang of things until doubt crept over me again and I was looking at yet another revision.

I think that the two parts I’ve really been missing / struggling to find in the DV research I’ve been doing are:

  • The “why” for both DV in general and for doing things a certain way. Why is it better? Why does it matter? What is the impact?
  • Must haves vs. nice to haves. Very much linked to the ‘why’, but if modelling is as much an art as a science, and it is very unlikely two modellers will produce an identical design, then it would be nice to know the fundamentals vs. ancillary modelling decisions / impacts. A lot of materials & forum posts act like there is only one way, and it’s their way.

P.S. The one hub & sat wasn’t implying a unique hub for every table. It was just saying that rather than deciding between 8 entities, it would be far easier to default to 2, and of course reusing hubs as much as possible. But again, if this approach was taken, what would be the cost / impact vs. doing ‘ideal’ DV?

1 Like

I think the “why” is that most core business databases evolve wrongly, misse building a slim semantic layer (subset of curated tables core business entities), and data warehouses instead of pulling what should already be done properly by OLTPs endup having the struggle of “fixing the model and the data”.

Hi John ! For me one thing seems sure : you feel left alone with that (DV|DWH|BigData|BI|AI) mess… And I am pretty sure almost every honest practitioner in that forum has already experienced that same feeling (I am at the moment feeling the same). You won’t compensate for the lack of sponsor from the upper spheres (That’s the IT, stupid !) or involvement from the lower spheres (who even asked them about the data they are using, defining or producing every single day ?).

So you have to connect the dots by yourself…

I hope I won’t be banned from the forum for telling that :

DV is a great thing, “for sure” (I’m French :smiling_face_with_sunglasses: ). But IMHO you would never have to write SQL by hand for the DV artifacts. Take a step back (or to the left, if you dare like in the Rocky Horror Picture Show). How are your source datasets mapping to the business concepts ? What are the different keysets ? Which ones are conformed on all the enterprise ? Which ones have to be integrated ? I found the work of Andrew Foad really interesting : HOOKs. Take a look at his work. It ‘s exactly the step to the left I just mentioned. DV is not the goal, it’s the vehicule of your data journey !

Best regards from French west coast !

Fabrice

Thanks John, my feelings are very similar and I am glad that I am not alone on this Data Vault journey.

I’ve been studying this architecture for 10 months now and it is the third time I am rebuilding most of my Data Vault because each time I thought I was doing it right I was completely wrong.

Information on this topic is sparce and hard to understand completely. No one knows a clear way to do this and even asking AI for help led me to big mistakes. Maybe now that I am following the course of Dan Lindsted (creator of DV) to get certified I’ll be able to understand a little more.

I am sorry that I did not add any additional value to the great replies you already got, I wrote this because I needed to rant a little too and took the chance, thank you.

Do not hesitate asking on this forum when you are feeling stuck, there are great people here who are ready to clarify your doubts and help you (talking from personal experience).

Best regards,

Alberto

1 Like