Deduplicate rows

When to use this

Duplicates creep in from many sources — an API that returns overlapping pages, a webhook that fires twice, or a full-refresh sync that re-ingests historical records. Before your data reaches the trusted layer, you need to pick the right deduplication strategy: exact duplicates (identical rows), or near-duplicates (same business key but different timestamps or values, where you want to keep only the latest).

Sample input

A contacts table in the Raw layer with duplicate records from overlapping syncs:

contact_id	name	email	updated_at
101	Alice Johnson	alice@acme.com	2024-03-15 10:00:00
101	Alice Johnson	alice@acme.com	2024-03-16 14:30:00
102	Bob Smith	bob@globex.com	2024-03-15 09:00:00
103	Carol Lee	carol@initech.com	2024-03-14 08:00:00
103	Carol Lee	carol.lee@initech.com	2024-03-17 11:00:00

Contact 101 appears twice with the same data but different timestamps. Contact 103 appears twice with an updated email. In both cases, we want to keep only the most recent row per contact_id.

Implementation

Nekt Express / BigQuery
Athena SQL
Python (Nekt SDK)

The same ROW_NUMBER() approach works in BigQuery. You can also use QUALIFY for a more concise syntax.

SELECT
  contact_id,
  name,
  email,
  updated_at
FROM `raw.contacts`
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY contact_id
  ORDER BY updated_at DESC
) = 1

QUALIFY filters the result of a window function directly, eliminating the need for a CTE or subquery. It’s a BigQuery extension that makes dedup queries much cleaner.

Use ROW_NUMBER() to rank rows within each group and keep only the latest one.

WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY contact_id
      ORDER BY updated_at DESC
    ) AS row_num
  FROM raw.contacts
)
SELECT
  contact_id,
  name,
  email,
  updated_at
FROM ranked
WHERE row_num = 1

For exact duplicates (fully identical rows), a simple SELECT DISTINCT is sufficient:

SELECT DISTINCT contact_id, name, email, updated_at
FROM raw.contacts

In PySpark, use Window functions with row_number to rank and filter duplicates.

import nekt
from pyspark.sql import functions as F
from pyspark.sql.window import Window

df = nekt.load_table(layer_name="Raw", table_name="contacts")

window = Window.partitionBy("contact_id").orderBy(F.col("updated_at").desc())

deduped_df = (
    df
    .withColumn("row_num", F.row_number().over(window))
    .filter(F.col("row_num") == 1)
    .drop("row_num")
)

nekt.save_table(
    df=deduped_df,
    layer_name="Trusted",
    table_name="contacts_deduped"
)

For exact duplicates, PySpark’s dropDuplicates is simpler:

deduped_df = df.dropDuplicates(["contact_id"])

However, this doesn’t guarantee which row is kept. Use the Window approach when you need control over which row survives (e.g., the most recent).

Expected output

contact_id	name	email	updated_at
101	Alice Johnson	alice@acme.com	2024-03-16 14:30:00
102	Bob Smith	bob@globex.com	2024-03-15 09:00:00
103	Carol Lee	carol.lee@initech.com	2024-03-17 11:00:00

Only the most recent row per contact_id is retained.

Tips and gotchas

Make sure the column you use for ordering (updated_at, _nekt_sync_at, etc.) is reliably populated. If some rows have NULL timestamps, they may sort unexpectedly. Add NULLS LAST in SQL or use F.col("updated_at").desc_nulls_last() in PySpark to push NULLs to the bottom.

When using _nekt_sync_at (the Nekt ingestion timestamp) for dedup ordering, keep in mind it reflects when the data was synced, not when it was updated at the source. Prefer a source-provided updated_at or modified_date column when available for more accurate deduplication.

​When to use this

​Sample input

​Implementation

​Expected output

​Tips and gotchas

When to use this

Sample input

Implementation

Expected output

Tips and gotchas