Skip to main content

When to use this

Raw data often contains rows that shouldn’t make it into your analytics — test records, incomplete entries, or data that violates business rules. Rather than letting bad data silently corrupt dashboards and reports, filter it out (or flag it) during the transformation step. Common scenarios include:
  • Removing test or sandbox records (e.g., emails ending in @test.com)
  • Excluding rows with missing required fields
  • Filtering out records outside a valid date range
  • Flagging anomalous values for review

Sample input

An orders table in the Raw layer with some invalid records:
order_idcustomer_emailamountorder_datestatus
1001alice@acme.com249.992024-03-15 10:30:00completed
1002test@test.com0.012024-03-16 14:22:00completed
1003bob@globex.com-50.002024-03-17 09:15:00completed
1004NULL1200.002024-03-18 11:00:00pending
1005carol@initech.com89.502024-03-19 16:45:00completed
We want to keep only rows where: the email is not a test address, the amount is positive, and the email is not NULL.

Implementation

Apply the same WHERE filters. BigQuery’s REGEXP_CONTAINS can be useful for more complex patterns.
SELECT
  order_id,
  customer_email,
  amount,
  order_date,
  status
FROM `raw.orders`
WHERE customer_email IS NOT NULL
  AND customer_email NOT LIKE '%@test.com'
  AND amount > 0
For more complex email validation, use regex:
WHERE REGEXP_CONTAINS(customer_email, r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')

Expected output

order_idcustomer_emailamountorder_datestatus
1001alice@acme.com249.992024-03-15 10:30:00completed
1005carol@initech.com89.502024-03-19 16:45:00completed
Rows 1002 (test email), 1003 (negative amount), and 1004 (NULL email) are excluded.

Tips and gotchas

Filtering is destructive — once rows are removed, they won’t appear in downstream tables. If you’re unsure whether certain records should be excluded, consider flagging them with a boolean column (is_valid) instead of removing them. This lets analysts make the final call.
Keep your validation rules documented and versioned. When business rules change (e.g., a new test domain gets added), you’ll want to update the filter in one place rather than hunting for hardcoded values across multiple transformations.