Business Key Identification

Hi all,

A lot of data vault literature talks about taking the time to properly identify business keys.

But not too many examples are given, apart from VIN which is mentioned over and over again.

I get that government identifiers are useful and often represent a good business key, however most business concepts / hubs will not have a nice government identifier available and consistently populated.

Dan states the importance of business keys in his source system vault article: Source System #datavault and Business Keys - DataVaultAlliance . He says don’t use surrogate keys as they are generated by the system. He then of course mentions VIN as an example. But then in a real life scenario he mentions he used contract number and tail number - but these are one off events, contracts aren’t like customers or staff members, so aren’t that helpful as an example. And how practical is it for an organisation to accurately capture something like contract_number in every source system that refers to it? Wouldn’t it often be manually entered and hence not 100% reliable? (He also mentions that he had 125 source systems - does he mean systems or objects / tables - resulting in 75 data vault objects. I don’t understand how DV can result in LESS objects than the source tables, unless there is significant preprocessing to denormalise the data)

Another example of a business key is often a customer name - but isn’t that dependent on how it is entered into a system? Isn’t it the case that some systems split out names into firstname and lastname and others don’t? Wouldn’t human entry of names into systems sometimes be different (e.g. John vs. Jon vs. Johnny), even if it is the same customer?

If the answer is “get the source to populate it properly”, what if the source is contact centre staff? How would you ensure they populate it properly and type in John instead of Jon when it is all manual entry?

And in terms of surrogates being lost during a system migration, wouldn’t a system migration / replacement store the legacy system’s ID as a field in the new system, allowing a same as link to be applied?

Most info on keys say to use a surrogate key if nothing else exists / as a last resort. I would be interested to know how often this last resort is used in practice?

And finally, would be great if people could give examples of good business keys for:

  • Sale
  • Address
  • Asset
  • Store
  • Transaction
  • Template
  • Model
  • Forecast

(assuming these are all ‘first class’ concepts and managed by an organisation in their own right). Or will the answer inevitably be - ‘depends on the organisation’? If this is the case, how does a business key stand the test of time? Yes a modeller might have done multiple iterations with the ‘business’ (btw who / what is ‘the business’? frontline sales? customer support? marketing analytics?), but 3 years down the track would the latest modeller and new staff members of the business agree with the original decision? (note: answer could be a massive yes, generally they do, just want to understand the thought process behind business key identification)

Thanks for your help

1 Like

You don’t always get a perfect business key in real life, so i try to find fields that meet as many of these characteristics as possible:

  • Does a domain expert recognise this field and use it in their day to day language?

  • Is it already used across multiple parts of the organisation to refer to the same thing?

  • Is it visible to users rather than purely internal/system‑generated? If it’s visible to users, then it’s a key the business are using already.

  • Can it stand alone without needing additional attributes to give it meaning? If it can’t stand alone then it probably isn’t a business entity.

System generated IDs can work if they are exposed to users and stable across systems, but most application IDs don’t survive system migrations or appear in multiple sources and that’s why I believe they’re discouraged as primary business keys.

On the “customer name” point…you’re right, manually entered names will always vary. I don’t think that’s a Data Vault problem. It’s a general data quality and matching problem. I often rely on IDs (customer number, account ID, email, etc.) to reduce ambiguity.

You’re also right that migrations can produce a mapping either internal or external to the system. That mapping allows you to maintain continuity using same as links. But a strong business key, when one exists, should remain unchanged across system migrations.

I can’t give you perfect examples for all the domains you listed, but for “transaction” it’s worth calling out Patrick Cuba’s view that a transaction isn’t a standalone business entity with its own business key, but a unit of work defined by the relationships between other business entities (customer, order, product, payment, etc.). That framing changed how I think about modelling them.

Modelling can feel subjective at times, and that’s normal. My advice is to build in small increments rather than trying to solve everything upfront. As long as you store data at its original grain, you always have what you need to correct assumptions and rebuild later if needed.

I’ll keep my reply concise this time :joy:

I liked the early BTPod episode about this. There are a couple more examples in there that might be helpful.

My main thoughts are the same as Dylan’s but expanded a bit for PII data: aim for something that wholly defines the entity uniquely, using tangible data if possible where that data is not sensitive data. Source system keys are often infinitely more identifiable than business keys so most projects default to them but they can make integrating new systems harder.

You’ve outlined the issue with customer name, spelling, overlaps, you name it. My biggest issue is that they’re identifiable information for a real person which often leads to heavier regulation. Names are messy. I prefer system identifiers for this since they are unique and clean but this has the drawbacks mentioned before.

1 Like