When to use this
Raw data often contains rows that shouldn’t make it into your analytics — test records, incomplete entries, or data that violates business rules. Rather than letting bad data silently corrupt dashboards and reports, filter it out (or flag it) during the transformation step.
Common scenarios include:
- Removing test or sandbox records (e.g., emails ending in
@test.com)
- Excluding rows with missing required fields
- Filtering out records outside a valid date range
- Flagging anomalous values for review
An orders table in the Raw layer with some invalid records:
| order_id | customer_email | amount | order_date | status |
|---|
| 1001 | alice@acme.com | 249.99 | 2024-03-15 10:30:00 | completed |
| 1002 | test@test.com | 0.01 | 2024-03-16 14:22:00 | completed |
| 1003 | bob@globex.com | -50.00 | 2024-03-17 09:15:00 | completed |
| 1004 | NULL | 1200.00 | 2024-03-18 11:00:00 | pending |
| 1005 | carol@initech.com | 89.50 | 2024-03-19 16:45:00 | completed |
We want to keep only rows where: the email is not a test address, the amount is positive, and the email is not NULL.
Implementation
Nekt Express / BigQuery
Athena SQL
Python (Nekt SDK)
Apply the same WHERE filters. BigQuery’s REGEXP_CONTAINS can be useful for more complex patterns.SELECT
order_id,
customer_email,
amount,
order_date,
status
FROM `raw.orders`
WHERE customer_email IS NOT NULL
AND customer_email NOT LIKE '%@test.com'
AND amount > 0
For more complex email validation, use regex:WHERE REGEXP_CONTAINS(customer_email, r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
Use WHERE clauses to filter out invalid rows based on your business rules.SELECT
order_id,
customer_email,
amount,
order_date,
status
FROM raw.orders
WHERE customer_email IS NOT NULL
AND customer_email NOT LIKE '%@test.com'
AND amount > 0
To keep a record of what was filtered out (useful for debugging), create a separate “rejected rows” table:SELECT *, 'invalid_email' AS rejection_reason
FROM raw.orders
WHERE customer_email IS NULL OR customer_email LIKE '%@test.com'
UNION ALL
SELECT *, 'negative_amount' AS rejection_reason
FROM raw.orders
WHERE amount <= 0
In PySpark, chain filter conditions to remove invalid rows.import nekt
from pyspark.sql import functions as F
df = nekt.load_table(layer_name="Raw", table_name="orders")
valid_df = (
df
.filter(F.col("customer_email").isNotNull())
.filter(~F.col("customer_email").like("%@test.com"))
.filter(F.col("amount") > 0)
)
nekt.save_table(
df=valid_df,
layer_name="Trusted",
table_name="orders_valid"
)
To flag instead of remove, add a validation column so downstream consumers can decide:flagged_df = df.withColumn(
"is_valid",
(F.col("customer_email").isNotNull())
& (~F.col("customer_email").like("%@test.com"))
& (F.col("amount") > 0)
)
Expected output
| order_id | customer_email | amount | order_date | status |
|---|
| 1001 | alice@acme.com | 249.99 | 2024-03-15 10:30:00 | completed |
| 1005 | carol@initech.com | 89.50 | 2024-03-19 16:45:00 | completed |
Rows 1002 (test email), 1003 (negative amount), and 1004 (NULL email) are excluded.
Tips and gotchas
Filtering is destructive — once rows are removed, they won’t appear in downstream tables. If you’re unsure whether certain records should be excluded, consider flagging them with a boolean column (is_valid) instead of removing them. This lets analysts make the final call.
Keep your validation rules documented and versioned. When business rules change (e.g., a new test domain gets added), you’ll want to update the filter in one place rather than hunting for hardcoded values across multiple transformations.