Created At

Oct 11, 2021 8:26 PM

Table of contents

Snapshots as a source of truth
Snapshot refresher
Limitations and caveats of snapshots
Separate your modeling into layers and give each layer one purpose
Creating a raw source
Filtering
Standardizing
Curating
Testing
Tricks

TODO: intro

Snapshots as a source of truth

Snapshot refresher

‣

Code example

Snapshots do what they sound like. However, what they are capturing is not the whole picture, but the differences. In the example below we have the raw table telling us the current status is pending on the left. On the right is the DBT snapshot of that raw data:

Raw data

id	status	updated_at
25	pending	2021-01-01

DBT snapshot of the raw data

id	status	updated_at	dbt_scd_id	dbt_updated_at	dbt_valid_from	dbt_valid_to
25	pending	2021-01-01	...	2021-01-01	2021-01-01

Now when that pending switches to sent a new row are created in the snapshot table representing the change.

Raw data after status changed to sent

id	status	updated_at
25	sent	2021-01-13

DBT snapshot of the raw data after the status change

id	status	updated_at	dbt_scd_id	dbt_updated_at	dbt_valid_from	dbt_valid_to
25	pending	2021-01-01	...	2021-01-01	2021-01-01	2021-01-05
25	sent	2021-01-05	...	2021-01-05	2021-01-05

Immediately improves the quality

More analytics
More confidence

Ways I like to make snapshots

Limitations and caveats of snapshots

refs can be a problem

If you have to use a ref in a snapshot tag it as post-run

Easily abused

Easy to add changes, difficult to remove them once they are there

Separate your modeling into layers and give each layer one purpose

Creating a raw source

Prefer no joins over joining

For some core models joining here just makes more sense

lookup tables

but prefer as few joins as possible

Think about separating out noisy columns to their own snapshots

Filtering

For the performance benefit of all downstream models, it's best to have the filtering layer immediately after creating the snapshots

Standardizing

Standardize more than just column names and data types

Standardize the organization of columns

Add standard calculations

‣

Filling in nulls

Curating

At the customization phase so we are filtering, and standardizing further
At this point we should have all possible changes to the data, filtered, and standardized
Marts

dim
fact

Testing

Really have a primary key test
Separate tests into categories

Q/A
Anomaly Detection
ETL specific tests

Usually, the number of tests a model has will scale with the number of joins

Tricks

Joining tables in a snapshot

servicer_loan_portfolio_updated_at = updated_at

How to make joins fast for timestamps

Create macros to help standardize processing the snapshots for downstream models

Generate base
Generate staging

Only specific changes from a snapshot

Any change can be captured with a lag

Turn snapshots into daily, weekly, or monthly data

Some things I learned using DBT