When to use this
Duplicates creep in from many sources — an API that returns overlapping pages, a webhook that fires twice, or a full-refresh sync that re-ingests historical records. Before your data reaches the trusted layer, you need to pick the right deduplication strategy: exact duplicates (identical rows), or near-duplicates (same business key but different timestamps or values, where you want to keep only the latest).Sample input
Acontacts table in the Raw layer with duplicate records from overlapping syncs:
| contact_id | name | updated_at | |
|---|---|---|---|
| 101 | Alice Johnson | alice@acme.com | 2024-03-15 10:00:00 |
| 101 | Alice Johnson | alice@acme.com | 2024-03-16 14:30:00 |
| 102 | Bob Smith | bob@globex.com | 2024-03-15 09:00:00 |
| 103 | Carol Lee | carol@initech.com | 2024-03-14 08:00:00 |
| 103 | Carol Lee | carol.lee@initech.com | 2024-03-17 11:00:00 |
101 appears twice with the same data but different timestamps. Contact 103 appears twice with an updated email. In both cases, we want to keep only the most recent row per contact_id.
Implementation
- Nekt Express / BigQuery
- Athena SQL
- Python (Nekt SDK)
The same
ROW_NUMBER() approach works in BigQuery. You can also use QUALIFY for a more concise syntax.Expected output
| contact_id | name | updated_at | |
|---|---|---|---|
| 101 | Alice Johnson | alice@acme.com | 2024-03-16 14:30:00 |
| 102 | Bob Smith | bob@globex.com | 2024-03-15 09:00:00 |
| 103 | Carol Lee | carol.lee@initech.com | 2024-03-17 11:00:00 |
contact_id is retained.
Tips and gotchas
When using
_nekt_sync_at (the Nekt ingestion timestamp) for dedup ordering, keep in mind it reflects when the data was synced, not when it was updated at the source. Prefer a source-provided updated_at or modified_date column when available for more accurate deduplication.