Link and the inlcusion or exclusion of a BK

Hi,

Quick question:

If I have a file/table that will be a link and I have 3 BKs, which will be hashed, that is a given, and these 3 BKs will provide the grain for unique record counts. If, however, I have other known BKs (to be hashed, of course) within the file/table, but have no effect on the grain, or contain descriptive fields within this file/table, should they not be included in the Link itself? If not, do they go into the satellite, hence they are in the role of an “attribute”?

1 Like

Hi John,

Big fan of your books (obligatory, I’m sorry)

I think that links should contain only what is needed to define them. If you have another source system in the future that has this relationship but not your extra business keys is that going to cause you a problem?

Anything unnecessary for defining the relationship is an attribute of the relationship in my eyes.

That’s my initial (helpful) answer but the perspective of this question raises some small red flags. You’re looking at creating vault objects from a file/table: source-driven. When these decisions should be based off what the relationships and entities themselves are used for: business-driven.

I’m sure you’ve got the big picture stuff under control so I’ll keep my nose out :sweat_smile:

Feel free to ask follow up Q’s I’m up for a back and forth and I’m sure many people have more insight to give.

Frankie

Ha, while I haven’t written any books, I was mentioned in the Acknowledgment section of a book, Data Governance Needs Risk Management, written by Mark Atkins and Terry Smith. They have a tool, Intralign, their company is Intraversed, and they teach you how to define business terms, i.e., classification, purpose, function, and characteristics, and other “rule types!” It’s great!!!

So, I’ve developed my “ontology” of business concepts, i.e., say, account, portfolio, asset, client, participant, agent, and a few key other core business elements, hence Hubs.

In my table/file, from my understanding, we have the grain of BK columns in the file, i.e., Unit of Work, and then the grain of the unique row count. Thus, if account, portfolio, and asset give me the unique row count grain, we are saying that if the file contains the grain of other BK columns, they should NOT be in the link?

That means, then, if we put those in a satellite, and we now need those BKs in an information mart, we’d need to create a link in the BV/Bridge. In my head, to do that, I would need to query the said satellite in the landing/stage, to create the hash, coordinate it as a job to insert into the hub(s), then go back to my BV, create a query to create a new link, and hash said BKs. Then, for the IM, could I properly rejoin the Hub to get descriptive data from the Hub satellite? Correct? Aside from that being a lot of future technical debt, the rule, if I’m correct, is that you can’t join a BK in a satellite back to its hub and join on the BKs, as this violates data vault rules.

Hence, I’m confused on how BKs in a file/table, that don’t participate in the row grain count for uniqueness, but are BK column(s) grain in the file/table, don’t belong in the link???

For “I think that links should contain only what is needed to define them. If you have another source system in the future that has this relationship but not your extra business keys is that going to cause you a problem?” That is what the BV is for, correct?

Thank you,

John

The K.I.S.S. is “if bk contributes to link grain ?”:

  • yes → add bk or dck to link.
  • otherwise →add as link sat attributes.

Additionally, if that extra bk(s) may indeed be part of defining grain on s other unit of work,mthen it’s a new link.

Hi Emanueol,

Can you help me define grain?

For granularity, the book states, “The granularity of links is defined by the number of hubs that they connect,” (pg.). 106, para. 5 (Section 4.4.3.).

For UoW, the book states, “The unit-of-work is a correlated set of data that keeps key sets together,” pg. 109, para. 3 (Section 4.4.4.), (see also training manual page 73 or 92, depending on training manual version.)

For splitting of links, the book states, “In some cases, data warehouse modelers try to split a link into multiple, smaller links in order to normalize the link information for modeling purposes,” pg. 109, para. 3 (Section 4.4.4.). And goes on to say that when rejoining, the one creates a new record that never existed in the source, and “this problem is also known as multivalued dependencies,” pg. 110, para. 2.

Thus, throughout the chapter, I see no textual signals to indicate otherwise, that if a BK is in a file/table, then “some” BK(s) may or may not be included, as the grain is defined by what BKs are in the file/table.

Thank you for responding.

John

This has been a question since data vault was invented! The BK would actually hash to a Sid that was a hub not a satellite. But, having the BK on the link in my experience is so useful for debugging purposes that despite the religious standpoint, just include it!

Hi TimNapier

Ahhh, I think we might be talking about two slightly different things. If I understand you correctly, you’re saying that if I already have the three hashed BKs in the link (from their respective hubs), you would still include the actual BK values in the link as well, purely for debugging and traceability.

My original question was slightly different. If a BK (like an Agent) appears in the source file/table but isn’t necessary to define the initial row grain (for example, account, portfolio, and asset determine the distinct row), yet still represents a real business participant in that relationship, wouldn’t that typically mean it should participate in the link grain (i.e., its hashed BK included in the link) rather than the Agent BK column/value being pushed into a satellite as an attribute?

I’m having trouble understanding how a participant that exists in the source record would not represent part of the business relationship (i.e., the UoW), although I realize others may see it as contextual rather than relational.

Thus, I’m basing my understanding on the definitions in the book and training material, which—at least as I interpret them—do not suggest that a BK present in the source should be excluded from the grain or unit of work of the link (hashed).

For example, the book states that “the granularity of links is defined by the number of hubs that they connect” (p.106, para. 5, Section 4.4.3). It also defines the UoW as “a correlated set of data that keeps key sets together” (p. 109, para. 3, Section 4.4.4; see also the training manual, p. 73 or p. 92, depending on the version).

The text further warns against splitting a link into smaller links solely for normalization purposes, noting that “data warehouse modelers try to split a link into multiple, smaller links” and that doing so can create “multivalued dependencies” (p.109–110, Section 4.4.4). It also states that “this problem is also known as multivalued dependencies,” pg. 110, para. 2.

Based on those definitions, my interpretation is that if a business participant (such as Agent) appears in the source record and participates in the correlated set of keys, it should belong in the link grain of BKs (hashed) rather than as a descriptive attribute in a satellite.

Thanks for your input!

John

Include in Link grain → when the BK defines/participates in the relationship itself (e.g., “this contract links Customer X + Product Y + Agent Z”). Hash it into the Link PK → ensures uniqueness and auditability.

Exclude from Link grain → when it’s contextual/descriptive (e.g., “notes about the agent” or attributes that can vary without breaking the core relationship). Place in Satellite on the Link → allows history tracking without inflating Link cardinality.

Last but not least, important to back design from business perspective vs trying create data vault based on source files which seems is you struggle.. avoid that temptation as easier to try map something physical than engage understand business domain ontology (what are hubs ? is the starting and most important point looks trivial right? just a table with PKs, but this can only be done properly after understand the business from.. business side, ignore whatever fields you receiving on files. Engage from right (consumer side), and start sith core fields.. resist temptation mapping every single field arriving on files.

Start with identifying hubs (is agent an hub or “note” attribute? if its an hub, is it single business classe or is there a subclasses of an entity..

The hard part of data vault isn’t the technical ddl of tables, but understand the business entities and relationships.

I’m leaning toward agreeing with Emanueol on this one (seems to be happening a lot lately in this forum!).

If a field contributes to the grain of the data in your source, then it can’t be ignored. When a field is defined as a business entity, it should be included in the link key and represented as its own foreign key within the link.

If the field contributes to the grain but isn’t identified as a business entity, then it should be treated as a dependent key. There have been plenty of discussions here around handling dependent keys - for example: Modeling a dependent child - Data Vault 2.0 - Data Community Forum

If the field falls into neither category, it’s likely descriptive and should live in your satellites.

Hi Dylan.roe and emanueol,

I think I’m seeing the disconnect. I’ll be more explicit.

I have a file/table that contains different attribute types. Specifically, there are five “BK” attributes: Account_ID, Portfolio_ID, Asset_ID, Agent_ID, and Beneficiary_ID. Additionally, there are “descriptive" attributes as well.

The “BK” attributes that make the record count grain unique (i.e., HAVING COUNT(*) > 1) include Account_ID, Portfolio_ID, and Asset_ID. The remaining two “BK” attributes, Agent_ID and Beneficiary_ID, along with the three previously mentioned “BK” attributes, are, for simple terms, the “BK” grain of the file/table.

Many argue that since the “BK” attributes Agent_ID and Beneficiary_ID are not involved in the record count uniqueness (i.e., HAVING COUNT(*) > 1), they should not be included as “hashed BKs” in the link. Instead, they ought to be stored as “descriptive” attributes in the link satellite. Too, I don’t see these two “BK” attributes as Dependent Child Keys either.

Does that make sense?

Thank you for your response.

John

Surely there must exist several types of relationships between all or some of those 5 entities (they all seem proper entities hubs).

Create as many links as relationships, each having the maximum number of hubs defining proper hubs.

1 Like

I’d be cautious about creating multiple links from a single source. When you later try to rebuild that source from several linked records, you risk introducing relationships that were never actually there.

If you really need a link at a reduced grain, it can always be created after the Raw Vault.