areto consulting blog

Data Vault on Snowflake: Point-in-Time (PIT) constructs & Join-Trees

Data Vault on Snowflake Point in Time constructs and Join Trees

Autor: Patrick Cuba, A Data Vault 2.0 Expert, Snowflake Solution Architect, Snowflake

Data Vault on Snowflake: Point-in-Time (PIT) constructs & Join-Trees

Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. Data Vault models are not built for consumption by Business Intelligence (BI) tools, they are built for automation and agility while guaranteeing auditability; making changes to a Data Vault model does not destroy the existing model, rather you augment it. To simplify and enhance querying a data vault model we will discuss the how and why you could consider building Point-in-Time (PIT) and Bridge tables.

Episode catalogue,

  1. Snowsight dashboards for Data Vault
  2. Point-in-Time constructs & Join Trees
  3. Querying really BIG satellite tables
  4. Streams & Tasks on Views
  5. Conditional Multi-Table INSERT, and where to use it
  6. Row Access Policies + Multi-Tenancy
  7. Hub locking on Snowflake
  8. Virtual Warehouses & Charge Back
  9. Out-of-sequence data
  10. Handling Semi-Structured Data

A reminder of the data vault table types,

And now we will introduce two new table types to our series that are not necessary data vault tables, but are constructs designed to efficiently get the data out of data vault. These tables do not offer the agilityautomation, and audit history we expect from data vault, these tables are disposable and designed to support information marts as views.

Episode 3: Point-in-Time (PIT) constructs & Join-Trees

We will extend the orchestration animation we extended in the previous episode with a PIT table and supported information marts

Data Vault Automation and Orchestration

More tables? Why are these necessary?

Pursuant of all the necessary attributes of a business object or unit of work, satellite tables capture the business object’s state or the relationship’s state. However, a raw vault satellite table is source specific, and a hub table can have multiple sources and each source system is automating business processes to which the data vault model is storing the historical context. To join state data from multiple sources around the hub or link table the user is forced to write complex logic to unify the data to serve the information mart layer. If this SQL logic is deployed once over the data vault, then the logic will be (although a repeatable pattern) executed at every query over that view. This is where a query assistance table such as a PIT is necessary, and it is necessary for two reasons,

  • Making use of platform technology to achieve what is called a “star-optimised query” by informing the platform that the PIT table should be treated like a fact table, whilst the satellite tables are treated as dimension tables.

Let’s explore these in a little more detail…

Simplifying SQL queries

Data vault model with a query assistance table

Key (pardon the pun) to understanding the implementation of a PIT table for querying a data vault is by comparing the implementation to how one would design facts and dimensions in a Kimball model. In Kimball modelling If a record in a dimension table is not available for a fact record a default dimension record is referenced instead (ex. referencing to a late arriving record). The same design principle is intended with a PIT table, each satellite table contains a ghost record, the ghost record has no value and is not tied to any business object or anything else of value. Instead, the ghost record is used to complete an equi-join for a PIT table, in other words, if for a point-in-time snapshot the record in an adjacent satellite table does not yet exist the PIT table will reference that ghost record for that business object until such time a record does start to appear for that satellite for the business object. This is not to say that the record in data vault is late arriving, rather for that snapshot date there is no historical context for that business object yet. A PIT can be used in place of a hub table or a link table depending on the data you are needing to unify.

If the ghost record was not in the design, a query over the data vault would rely on a mix of SQL inner, left, and right joins, simply too much complexity to solve a simple query! The same design does go into facts and dimensions, querying facts and dimensions should rely on a date dimension, a central fact table and other relevant dimensions joining through a single join key between tables resembling a star. Whereas in a Kimball model the date filter is applied on the date dimension only, the date filter in data vault is the snapshot date in the PIT table itself!

To illustrate how the above is implemented in a PIT table, let’s pretend we only have one business object and accordingly, add some records to two contributing satellites for that business key. Note that the first record in the satellite is the ghost record (‘000’), it does not belong to any business object but is there to complete the equi-join for all business objects.

Snapshot of hashkeys and load dates

The PIT table acts like a hashmap with reference points on where to find the appropriate record in each satellite table and which load dates are applicable for that snapshot date. Using a hash digest as a distribution key is very handy for a system that relies on it; Snowflake does not so we can use something else instead.

Facts and dimensions use a surrogate sequence key instead of what we have done with the PIT construct illustrated above, in that we used surrogate hash keys and load date timestamps. But in fact, we can achieve the same outcome in data vault by defining an auto-increment column upon table creation for each satellite and use that temporal column in the PIT table instead.

Let’s revisit the above illustration but use the surrogate sequence key instead.

Snapshot of sequence ids

A unique sequence key will be the equivalent of combining hash keys with load dates, the advantage of this updated approach is that we are using a single integer column instead and therefore the joins will be faster when bringing together large satellite tables into the join condition. The PIT table construct becomes a sequence number-only point-in-time table because the sequence ids themselves have temporal meaning. No matter if a surrounding satellite table does not yet have a record for that business object, the same equi-join condition is used, example pseudo code:

Next up, how Snowflake can optimise your PIT queries.

Star optimised query

Because metadata statistics on all Snowflake objects are always up to date, Snowflake’s query optimiser will recognise immediately that the central PIT table in a multi-table join as the largest table. When such a query is recognised, the optimiser will employ an SQL join strategy called Right-Deep Join Tree, here’s how it works:

1) Left Deep, 2) Right Deep, 3) Bushy Tree
  1. Right deep join tree employs a hash join to join tables. Internally hash tables are created in a build phase that is used to probe the large central table for matches. This join requires an equi-join predicate, see
  2. Bushy tree contains a mix of the above

Join strategies require that the metadata statistics are up to date for the optimiser to decide what is the best strategy and since Snowflake’s object statistics are always up to date you can feel confident Snowflake will pick the right join strategy for your query. The way you identify that a right deep join tree has occurred is by observing the shape of your query plan in the query profiler in Snowsight.

Right-Deep Join Tree

For small volumes the join plan might not matter, but on much larger queries the query performance gain can be quite significant. Here’s how,

With the above table statistics, the below are the statistics produced when joining the four satellite tables to deploy a virtual information mart.

1) Without a PIT, 2) With a PIT using HashKeys & LoadDates, 3) With a PIT using Seq-ids
  1. A PIT table with hash keys and load dates reduced the query time down to 20 seconds with far less network traffic, a saving of over 8 minutes!
  2. A PIT table with sequence ids halved the above byte numbers and shaved off a further 5 seconds in query time!

All the above tests were conducted with RESULTCACHE turned off, an XSMALL virtual warehouse which was flushed between run tests. The results are consistent.

Part 1 of 2 on query optimisation

Recognising that Snowflake has an OLAP query engine is the first step to understanding how to design efficient data models and identifying when query assistance tables can be used to improve the performance you get out of your Data Vault on Snowflake.

In the next episode we will discuss another technique that will greatly improve query times for very large satellite tables, those with a sharp eye may have seen this technique already being used in the above star optimised query! Until next time!


The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

PS: the use of sequence-ids in the satellite tables and its selection in PIT tables is my own variation of the regular PIT structure to improve join performance on Snowflake.

Blog-Beitrag teilen

areto Marketing

+49 221 66 95 75-0