Autor: Patrick Cuba, A Data Vault 2.0 Expert, Snowflake Solution Architect, Snowflake
Data Vault on Snowflake: Point-in-Time (PIT) constructs & Join-Trees
Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. Data Vault models are not built for consumption by Business Intelligence (BI) tools, they are built for automation and agility while guaranteeing auditability; making changes to a Data Vault model does not destroy the existing model, rather you augment it. To simplify and enhance querying a data vault model we will discuss the how and why you could consider building Point-in-Time (PIT) and Bridge tables.
- Immutable Store, Virtual End-Dates
- Snowsight dashboards for Data Vault
- Point-in-Time constructs & Join Trees
- Querying really BIG satellite tables
- Streams & Tasks on Views
- Conditional Multi-Table INSERT, and where to use it
- Row Access Policies + Multi-Tenancy
- Hub locking on Snowflake
- Virtual Warehouses & Charge Back
- Out-of-sequence data
- Handling Semi-Structured Data
A reminder of the data vault table types,
And now we will introduce two new table types to our series that are not necessary data vault tables, but are constructs designed to efficiently get the data out of data vault. These tables do not offer the agility, automation, and audit history we expect from data vault, these tables are disposable and designed to support information marts as views.
Episode 3: Point-in-Time (PIT) constructs & Join-Trees
We will extend the orchestration animation we extended in the previous episode with a PIT table and supported information marts
More tables? Why are these necessary?
Pursuant of all the necessary attributes of a business object or unit of work, satellite tables capture the business object’s state or the relationship’s state. However, a raw vault satellite table is source specific, and a hub table can have multiple sources and each source system is automating business processes to which the data vault model is storing the historical context. To join state data from multiple sources around the hub or link table the user is forced to write complex logic to unify the data to serve the information mart layer. If this SQL logic is deployed once over the data vault, then the logic will be (although a repeatable pattern) executed at every query over that view. This is where a query assistance table such as a PIT is necessary, and it is necessary for two reasons,
- Simplifying the SQL query between the satellite tables by presenting the satellite table keys relevant to the point in time snapshot for the reporting period required.
- Making use of platform technology to achieve what is called a “star-optimised query” by informing the platform that the PIT table should be treated like a fact table, whilst the satellite tables are treated as dimension tables.
Let’s explore these in a little more detail…
Simplifying SQL queries
Key (pardon the pun) to understanding the implementation of a PIT table for querying a data vault is by comparing the implementation to how one would design facts and dimensions in a Kimball model. In Kimball modelling If a record in a dimension table is not available for a fact record a default dimension record is referenced instead (ex. referencing to a late arriving record). The same design principle is intended with a PIT table, each satellite table contains a ghost record, the ghost record has no value and is not tied to any business object or anything else of value. Instead, the ghost record is used to complete an equi-join for a PIT table, in other words, if for a point-in-time snapshot the record in an adjacent satellite table does not yet exist the PIT table will reference that ghost record for that business object until such time a record does start to appear for that satellite for the business object. This is not to say that the record in data vault is late arriving, rather for that snapshot date there is no historical context for that business object yet. A PIT can be used in place of a hub table or a link table depending on the data you are needing to unify.
If the ghost record was not in the design, a query over the data vault would rely on a mix of SQL inner, left, and right joins, simply too much complexity to solve a simple query! The same design does go into facts and dimensions, querying facts and dimensions should rely on a date dimension, a central fact table and other relevant dimensions joining through a single join key between tables resembling a star. Whereas in a Kimball model the date filter is applied on the date dimension only, the date filter in data vault is the snapshot date in the PIT table itself!
To illustrate how the above is implemented in a PIT table, let’s pretend we only have one business object and accordingly, add some records to two contributing satellites for that business key. Note that the first record in the satellite is the ghost record (‘000’), it does not belong to any business object but is there to complete the equi-join for all business objects.
The PIT table acts like a hashmap with reference points on where to find the appropriate record in each satellite table and which load dates are applicable for that snapshot date. Using a hash digest as a distribution key is very handy for a system that relies on it; Snowflake does not so we can use something else instead.
Facts and dimensions use a surrogate sequence key instead of what we have done with the PIT construct illustrated above, in that we used surrogate hash keys and load date timestamps. But in fact, we can achieve the same outcome in data vault by defining an auto-increment column upon table creation for each satellite and use that temporal column in the PIT table instead.
Let’s revisit the above illustration but use the surrogate sequence key instead.
A unique sequence key will be the equivalent of combining hash keys with load dates, the advantage of this updated approach is that we are using a single integer column instead and therefore the joins will be faster when bringing together large satellite tables into the join condition. The PIT table construct becomes a sequence number-only point-in-time table because the sequence ids themselves have temporal meaning. No matter if a surrounding satellite table does not yet have a record for that business object, the same equi-join condition is used, example pseudo code:
Next up, how Snowflake can optimise your PIT queries.
Star optimised query
Because metadata statistics on all Snowflake objects are always up to date, Snowflake’s query optimiser will recognise immediately that the central PIT table in a multi-table join as the largest table. When such a query is recognised, the optimiser will employ an SQL join strategy called Right-Deep Join Tree, here’s how it works:
- Left deep join tree employs a nested loop to join tables, the result of the first join is used as a base table to probe the subsequent table which produces an internal interim table until all join conditions are satisfied. For a mix of join types and varying sizes and without a central table to act as the centre piece of the join query this type of join may be more optimal, see bit.ly/3rQLz57
- Right deep join tree employs a hash join to join tables. Internally hash tables are created in a build phase that is used to probe the large central table for matches. This join requires an equi-join predicate, see bit.ly/3EMwoPS.
- Bushy tree contains a mix of the above
Join strategies require that the metadata statistics are up to date for the optimiser to decide what is the best strategy and since Snowflake’s object statistics are always up to date you can feel confident Snowflake will pick the right join strategy for your query. The way you identify that a right deep join tree has occurred is by observing the shape of your query plan in the query profiler in Snowsight.
For small volumes the join plan might not matter, but on much larger queries the query performance gain can be quite significant. Here’s how,
With the above table statistics, the below are the statistics produced when joining the four satellite tables to deploy a virtual information mart.
- Without a PIT table the query run time was nearly 8 and half minutes! Note the volume of data scanned over the network!
- A PIT table with hash keys and load dates reduced the query time down to 20 seconds with far less network traffic, a saving of over 8 minutes!
- A PIT table with sequence ids halved the above byte numbers and shaved off a further 5 seconds in query time!
All the above tests were conducted with RESULTCACHE turned off, an XSMALL virtual warehouse which was flushed between run tests. The results are consistent.
Part 1 of 2 on query optimisation
Recognising that Snowflake has an OLAP query engine is the first step to understanding how to design efficient data models and identifying when query assistance tables can be used to improve the performance you get out of your Data Vault on Snowflake.
In the next episode we will discuss another technique that will greatly improve query times for very large satellite tables, those with a sharp eye may have seen this technique already being used in the above star optimised query! Until next time!
- Snowflake Query Profiler — https://docs.snowflake.com/en/user-guide/ui-query-profile.html
- Snowflake SQL JOIN — https://docs.snowflake.com/en/sql-reference/constructs/join.html
- Auto-increment / identity column — https://docs.snowflake.com/en/sql-reference/sql/create-table.html
- Zero Keys and Ghost Records — https://patrickcuba.medium.com/data-vault-mysteries-zero-keys-ghost-records-3be7cb05ee94
- Originally published here: https://www.snowflake.com/blog/point-in-time-constructs-and-join-trees/
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.
PS: the use of sequence-ids in the satellite tables and its selection in PIT tables is my own variation of the regular PIT structure to improve join performance on Snowflake.