Hi all,
Apologies in advance for my rant below.
So I’m one cog in a machine that is trying to build a data vault. The machine has been running for a while now and I’m coming in a bit blind. I’m doing my best to understand how I should model, but finding that I’m constantly changing my mind on the right model structure for the data I’m touching.
We are at the raw vault stage, modelling source by source, for a subset of source tables defined as being in scope.
Some of high level issues I’m struggling with are:
- DV is so much more than just ‘Hubs, Links and Satellites’ - all up I can see 8 DV entity types (Hubs, Links, Sats, MSats, MLinks, EffSats, NHLinks, SALinks)
- Most DV examples assume perfect data structures that just doesn’t matter reality - if I see another “sales, customers, orders”
- Source data varies so much from one source to the next
- What I’ve seen modelled to date in the organisation doesn’t look very consistent
- Going to AI models to help has been the opposite of help - keeps giving different answers, often when I give them no additional information but just remind them of a prior response
- There are constraints applied in the organisations modelling approach that means many responses / examples don’t apply (e.g. pre-processing, staging, splitting of tables etc.)
- But most of all I can’t see a simple decision tree of “when you have this situation, then model like that”
Maybe I’m asking too much, but I just don’t see how a modelling technique can be successfully implemented, at scale, by multiple people, with organisational constraints they can’t necessarily control, when there is so much subjectivity involved.
Is it a ‘true business concept’? I don’t know, you tell me. Go ask the business? Yeah cause they are chomping at the bit to talk data models….
Is this a genuine business key? Again, I don’t know. Who talks about order 12345? yeah a few people. Customer_ID 12345 not so many. DocumentationRequest 12345 hardly any.
How should I model this table? Well if it is this type of data, then do this, otherwise do that. Oh, unless it is a FK to another table, in which case do this instead. But also, don’t listen to person X, do this instead. No listen to me and do my way instead.
Or even better, well it depends on how the data is going to be used. Yeah so I should ask the 10000 potential users individually on what they will do with the data? And do they even know how they’ll use it?
Highly normalised data? good luck. Lots of NULLs? good luck. 20FKs in your table? here are the 55 different approaches you can take. 100+ data sources needing to be delivered within a reasonable programme type timeframe? very very very good luck.
But the best of all is, you want to know more? Buy my book, get certified, pay lots for some training, read my paywalled blog etc. etc.
I can generally pick things up reasonably quickly, but DV modelling has just stumped me. I am pretty methodological and like to follow a pattern / process / rinse and repeat type approach to things, and my brain feels like it is about to explode with all the research I’m doing without getting any closer that pattern / process / rinse and repeat approach.
And really if it isn’t simple, isn’t close to black and white, doesn’t result in outputs 80%+ similar not matter who is modelling, then why is it valuable?
Can’t we just do one Hub and one Sat per source table and be done with it?