Some things I learned using DBT
Some things I learned using DBT

Some things I learned using DBT

Previous Post
Created At
Oct 11, 2021 8:26 PM

TODO: intro

Snapshots as a source of truth

Snapshot refresher

‣
Code example
💡
{% snapshot h_statuses %}
	
	{{
    config(
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at',
    )
	}}

	select * from {{source('app', 'status')}}

{% endsnapshot %}
icon

Snapshots do what they sound like. However, what they are capturing is not the whole picture, but the differences. In the example below we have the raw table telling us the current status is pending on the left. On the right is the DBT snapshot of that raw data:

icon

Raw data

Untitled

idstatusupdated_at
pending
2021-01-01
image
icon

DBT snapshot of the raw data

idstatusupdated_atdbt_scd_iddbt_updated_atdbt_valid_fromdbt_valid_to
pending
2021-01-01
...
2021-01-01
2021-01-01
icon

Now when that pending switches to sent a new row are created in the snapshot table representing the change.

icon

Raw data after status changed to sent

idstatusupdated_at
sent
2021-01-13
image

icon
DBT snapshot of the raw data after the status change

idstatusupdated_atdbt_scd_iddbt_updated_atdbt_valid_fromdbt_valid_to
pending
2021-01-01
...
2021-01-01
2021-01-01
2021-01-05
sent
2021-01-05
...
2021-01-05
2021-01-05
icon
  • Immediately improves the quality
    • More analytics
    • More confidence
  • Ways I like to make snapshots

Limitations and caveats of snapshots

  • refs can be a problem
    • If you have to use a ref in a snapshot tag it as post-run
  • Easily abused
    • Easy to add changes, difficult to remove them once they are there

Separate your modeling into layers and give each layer one purpose

Creating a raw source

  • Prefer no joins over joining
    • For some core models joining here just makes more sense
      • lookup tables
    • but prefer as few joins as possible
  • Think about separating out noisy columns to their own snapshots

Filtering

  • For the performance benefit of all downstream models, it's best to have the filtering layer immediately after creating the snapshots

Standardizing

  • Standardize more than just column names and data types
    • Standardize the organization of columns
  • Add standard calculations
‣
Filling in nulls
  • I've had mixed results standardizing the null values at this layer

Curating

  • At the customization phase so we are filtering, and standardizing further
  • At this point we should have all possible changes to the data, filtered, and standardized
  • Marts
    • dim
    • fact

Testing

  • Really have a primary key test
  • Separate tests into categories
    • Q/A
    • Anomaly Detection
    • ETL specific tests
  • Usually, the number of tests a model has will scale with the number of joins

Tricks

  • Joining tables in a snapshot
    • servicer_loan_portfolio_updated_at = updated_at
      • How to make joins fast for timestamps
  • Create macros to help standardize processing the snapshots for downstream models
    • Generate base
    • Generate staging
  • Only specific changes from a snapshot
    • Any change can be captured with a lag
  • Turn snapshots into daily, weekly, or monthly data