Filter invalid rows

When to use this

Raw data often contains rows that shouldn’t make it into your analytics — test records, incomplete entries, or data that violates business rules. Rather than letting bad data silently corrupt dashboards and reports, filter it out (or flag it) during the transformation step. Common scenarios include:

Removing test or sandbox records (e.g., emails ending in @test.com)
Excluding rows with missing required fields
Filtering out records outside a valid date range
Flagging anomalous values for review

Sample input

An orders table in the Raw layer with some invalid records:

order_id	customer_email	amount	order_date	status
1001	alice@acme.com	249.99	2024-03-15 10:30:00	completed
1002	test@test.com	0.01	2024-03-16 14:22:00	completed
1003	bob@globex.com	-50.00	2024-03-17 09:15:00	completed
1004	NULL	1200.00	2024-03-18 11:00:00	pending
1005	carol@initech.com	89.50	2024-03-19 16:45:00	completed

We want to keep only rows where: the email is not a test address, the amount is positive, and the email is not NULL.

Implementation

Nekt Express / BigQuery
Athena SQL
Python (Nekt SDK)

Apply the same WHERE filters. BigQuery’s REGEXP_CONTAINS can be useful for more complex patterns.

SELECT
  order_id,
  customer_email,
  amount,
  order_date,
  status
FROM `raw.orders`
WHERE customer_email IS NOT NULL
  AND customer_email NOT LIKE '%@test.com'
  AND amount > 0

For more complex email validation, use regex:

WHERE REGEXP_CONTAINS(customer_email, r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')

Use WHERE clauses to filter out invalid rows based on your business rules.

SELECT
  order_id,
  customer_email,
  amount,
  order_date,
  status
FROM raw.orders
WHERE customer_email IS NOT NULL
  AND customer_email NOT LIKE '%@test.com'
  AND amount > 0

To keep a record of what was filtered out (useful for debugging), create a separate “rejected rows” table:

SELECT *, 'invalid_email' AS rejection_reason
FROM raw.orders
WHERE customer_email IS NULL OR customer_email LIKE '%@test.com'

UNION ALL

SELECT *, 'negative_amount' AS rejection_reason
FROM raw.orders
WHERE amount <= 0

In PySpark, chain filter conditions to remove invalid rows.

import nekt
from pyspark.sql import functions as F

df = nekt.load_table(layer_name="Raw", table_name="orders")

valid_df = (
    df
    .filter(F.col("customer_email").isNotNull())
    .filter(~F.col("customer_email").like("%@test.com"))
    .filter(F.col("amount") > 0)
)

nekt.save_table(
    df=valid_df,
    layer_name="Trusted",
    table_name="orders_valid"
)

To flag instead of remove, add a validation column so downstream consumers can decide:

flagged_df = df.withColumn(
    "is_valid",
    (F.col("customer_email").isNotNull())
    & (~F.col("customer_email").like("%@test.com"))
    & (F.col("amount") > 0)
)

Expected output

order_id	customer_email	amount	order_date	status
1001	alice@acme.com	249.99	2024-03-15 10:30:00	completed
1005	carol@initech.com	89.50	2024-03-19 16:45:00	completed

Rows 1002 (test email), 1003 (negative amount), and 1004 (NULL email) are excluded.

Tips and gotchas

Filtering is destructive — once rows are removed, they won’t appear in downstream tables. If you’re unsure whether certain records should be excluded, consider flagging them with a boolean column (is_valid) instead of removing them. This lets analysts make the final call.

Keep your validation rules documented and versioned. When business rules change (e.g., a new test domain gets added), you’ll want to update the filter in one place rather than hunting for hardcoded values across multiple transformations.

​When to use this

​Sample input

​Implementation

​Expected output

​Tips and gotchas

When to use this

Sample input

Implementation

Expected output

Tips and gotchas