How to handle late-arriving data for effectivity sats

Hi folks,
I have a question that’s been keeping me awake at night. We receive CRM data using Kafka in a custom data model. We currently load the data in batches. The problem is that sometimes something can get lost and information about a change arrives later, simply classic late-arriving data.

For classic satellites, I can use XTS to correct the timeline. But how can I correctly correct the effectivity sats? I tried to find something on this, but I couldn’t find anything.

Thank you for your answers, guidance, and even pieces of pseudo code.

Heya Vit,

Welcome to the forum :grin:

Sorry for paywalled links but Patrick Cuba wrote a pretty good Medium article covering a lot of info regarding out of sequence loads. If you’ve got access feel free to have a look through.

Since you’ve specfically requested pseudocode. I remember AutomateDV used to have a template for out of sequence sats that will be available their repo somewhere, I think it’s getting renovated or replaced so you might need to search through some historic versions to dig it up.

I normally try to rely on the source effective timestamp for resolving these issues and using the bitemporality of the vault to handle out of sequence events later down the pipeline with some transformations if needed (usually not).

Though this is not an area I get to tinker with or put into practice very often so I’d love to hear what others have to say and iterate on best practices as a group!

All the best,
Frankie

Hi!

As Frankie says, the intent of the optional src_eff parameter on regular Satellites in AutomateDV is to support business-effectivity tracking for those records. This only applies to regular Sats, not Effectivity Sats.

If we take a step back and assume we aren’t using AutomateDV, it helps to look at this in a tool-agnostic way.

Late-arriving data forces us to reassess “what we knew, and when we knew it”. We can think of this as two timelines (or alternate universes, for fun):

Timeline 1: What we believed to be true before the late-arriving data
Timeline 2: What we later discovered to have been true once that data arrived

A few important things to note/ take away from this:

  1. The business may have taken action based on Timeline 1, even though Timeline 2 tells us that information was incomplete or incorrect.
  2. Changes to business-effective timelines can have downstream knock-on effects.
  3. Rewriting history is generally ill-advised in the Raw Vault.

Starting with point 3: the Raw Vault must remain exactly that: raw. This is why regular Satellites in AutomateDV optionally support src_eff: it allows you to capture an additional business-effective timestamp alongside load time, so you can reason about multiple timelines later. It is not intended to overwrite or reorder history in place.

Late-arriving or out-of-sequence data should therefore be loaded as-is, preserving both load timestamps and business-effective timestamps. Effectivity Satellites allow you to record “when a relationship was discovered” versus “when it was actually valid”, without rewriting history.

Any reconciliation between those timelines (for example deciding which version of reality to report on) belongs in the Business Vault or downstream presentation layers, not in the Raw Vault itself.

With Effectivity Satellites in particular, points 1 and 2 become more visible. For example, consider an Effectivity Satellite tracking an employee’s department assignment. HR may only discover after the fact that an employee moved departments, potentially after payroll has already closed.

That late discovery can impact cost allocation, reporting, and other downstream processes, even though the Raw Vault correctly reflects both what was known at the time and what was later discovered.

All of this is to say: track the right metadata in the Raw Vault so downstream business rules can be “late-arriving aware”. You don’t “fix” late-arriving data in the Raw Vault because there is nothing to fix. The reality is that there were always two timelines.

How those timelines are reconciled or presented to the business is ultimately a business decision, but the first step is capturing them separately and faithfully.

Once you’ve done that, you can have an informed conversation with the business about how they want to handle these scenarios, armed with both the knowledge and the data from your newly discovered multiverse.

I hope that helps you sleep a little more easily at night :slightly_smiling_face:

3 Likes