Previous Post
Created At
Oct 11, 2021 8:26 PM
TODO: intro
Snapshots as a source of truth
Snapshot refresher
‣
Snapshots do what they sound like. However, what they are capturing is not the whole picture, but the differences. In the example below we have the raw table telling us the current status is pending on the left. On the right is the DBT snapshot of that raw data:
Raw data
id | status | updated_at |
---|---|---|
pending | 2021-01-01 |
DBT snapshot of the raw data
id | status | updated_at | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
---|---|---|---|---|---|---|
pending | 2021-01-01 | ... | 2021-01-01 | 2021-01-01 |
Now when that pending switches to sent a new row are created in the snapshot table representing the change.
Raw data after status changed to sent
id | status | updated_at |
---|---|---|
sent | 2021-01-13 |
DBT snapshot of the raw data after the status change
- Immediately improves the quality
- More analytics
- More confidence
- Ways I like to make snapshots
Limitations and caveats of snapshots
- refs can be a problem
- If you have to use a ref in a snapshot tag it as post-run
- Easily abused
- Easy to add changes, difficult to remove them once they are there
Separate your modeling into layers and give each layer one purpose
Creating a raw source
- Prefer no joins over joining
- For some core models joining here just makes more sense
- lookup tables
- but prefer as few joins as possible
- Think about separating out noisy columns to their own snapshots
Filtering
- For the performance benefit of all downstream models, it's best to have the filtering layer immediately after creating the snapshots
Standardizing
- Standardize more than just column names and data types
- Standardize the organization of columns
- Add standard calculations
‣
Curating
- At the customization phase so we are filtering, and standardizing further
- At this point we should have all possible changes to the data, filtered, and standardized
- Marts
- dim
- fact
Testing
- Really have a primary key test
- Separate tests into categories
- Q/A
- Anomaly Detection
- ETL specific tests
- Usually, the number of tests a model has will scale with the number of joins
Tricks
- Joining tables in a snapshot
- servicer_loan_portfolio_updated_at = updated_at
- How to make joins fast for timestamps
- Create macros to help standardize processing the snapshots for downstream models
- Generate base
- Generate staging
- Only specific changes from a snapshot
- Any change can be captured with a lag
- Turn snapshots into daily, weekly, or monthly data