Skip to main content

When to use this

Duplicates creep in from many sources — an API that returns overlapping pages, a webhook that fires twice, or a full-refresh sync that re-ingests historical records. Before your data reaches the trusted layer, you need to pick the right deduplication strategy: exact duplicates (identical rows), or near-duplicates (same business key but different timestamps or values, where you want to keep only the latest).

Sample input

A contacts table in the Raw layer with duplicate records from overlapping syncs:
contact_idnameemailupdated_at
101Alice Johnsonalice@acme.com2024-03-15 10:00:00
101Alice Johnsonalice@acme.com2024-03-16 14:30:00
102Bob Smithbob@globex.com2024-03-15 09:00:00
103Carol Leecarol@initech.com2024-03-14 08:00:00
103Carol Leecarol.lee@initech.com2024-03-17 11:00:00
Contact 101 appears twice with the same data but different timestamps. Contact 103 appears twice with an updated email. In both cases, we want to keep only the most recent row per contact_id.

Implementation

The same ROW_NUMBER() approach works in BigQuery. You can also use QUALIFY for a more concise syntax.
SELECT
  contact_id,
  name,
  email,
  updated_at
FROM `raw.contacts`
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY contact_id
  ORDER BY updated_at DESC
) = 1
QUALIFY filters the result of a window function directly, eliminating the need for a CTE or subquery. It’s a BigQuery extension that makes dedup queries much cleaner.

Expected output

contact_idnameemailupdated_at
101Alice Johnsonalice@acme.com2024-03-16 14:30:00
102Bob Smithbob@globex.com2024-03-15 09:00:00
103Carol Leecarol.lee@initech.com2024-03-17 11:00:00
Only the most recent row per contact_id is retained.

Tips and gotchas

Make sure the column you use for ordering (updated_at, _nekt_sync_at, etc.) is reliably populated. If some rows have NULL timestamps, they may sort unexpectedly. Add NULLS LAST in SQL or use F.col("updated_at").desc_nulls_last() in PySpark to push NULLs to the bottom.
When using _nekt_sync_at (the Nekt ingestion timestamp) for dedup ordering, keep in mind it reflects when the data was synced, not when it was updated at the source. Prefer a source-provided updated_at or modified_date column when available for more accurate deduplication.