Ingesting JSON in to Raw Vault

Hi

Are there any patterns for ingesting JSON files in to the raw vault? We have an upcoming use case where we will need to ingest medical questionnaires. We have developed our own framework for automating ingestion in to DV but so far all of our data sources have been structured data.

Thanks

I would say critical hubs pk identification as per business needs, and core attributes of hubs and links:

  • ingest json into VARIANT similar type as-is, due to potentially complexity better just flatten the real few core json fields, then as business consumers start asking more fields, its easy BV views flattening json fields from the raw SAT.
  • Eventually that will continue (new views for demanded new fields), flattening json is fast in Snowflake and more Important views are easy to update.. eventually everyone will gain good grasp of what great hubs, SATs and links would a materialization of the json data could be.

So: ingest json in single column VARIANT–> expand with views –> wait till consumers stabilize on their needs –> and build views on BV or Presentation data products layer –> wait let interaction happen with views till materialize questionnaires into their own hubs, sats, links and etc..

For example its very common and good practice to keep in 1 single variant column metadata about all dv actions (source system, source table, Ingestion process , app_id, bu_id, etc etc, governance cols, etc)

1 Like

Hi Martin

Semi-structured data is always going to be the bane of any architecture really. I think DV2.1 does more thinking in this space but I’ll give my take on the limited DV2.0 perspective.

As Emmanuel mentioned all modern data platforms have options for storing semi-structured data directly in a column. So you can treat it as a payload column. If there’s business keys in there you can pull them out in staging and handle them separately. This is my least prefered method mostly because these platforms are still trying to figure out how to handle these data types efficiently. Storing semi-structured data in an architected warehouse of any stripes feels like an anti-pattern.

The alternative that I do like but isn’t always applicable is more of an organisational approach. Establish with your source what the schema of the incoming JSON is,. And enforce this as a data-contract, ensure that any data not matching this schema is rejected and reported as a source error.

If it matches the schema then normalise and flatten the data on ingestion and treat the structured data like structured data.

If you can’t establish the shape of the data you’re being provided then trying to model it in anyway that favours optimization or efficiency is basically a non-starter.

Frankie

1 Like