Skip to main content

When to use this

Raw data almost always has gaps — optional fields left blank, API responses with missing keys, or records that were partially synced. NULLs can break aggregations, cause unexpected JOIN behavior, and confuse BI tools. Handling them explicitly ensures your trusted layer is clean and predictable.

Sample input

A contacts table in the Raw layer with some missing values:
contact_idnameemailcompanyphone
1Alice Johnsonalice@acme.comAcme+1-555-0101
2Bob SmithNULLNULL+1-555-0102
3NULLcarol@initech.comInitechNULL
We want to replace NULLs with sensible defaults: "Unknown" for text fields and "N/A" for phone.

Implementation

BigQuery supports the same COALESCE syntax. You can also use IFNULL as a shorthand when there are only two arguments.
SELECT
  contact_id,
  COALESCE(name, 'Unknown')     AS name,
  IFNULL(email, 'Unknown')      AS email,
  COALESCE(company, 'Unknown')  AS company,
  IFNULL(phone, 'N/A')          AS phone
FROM `raw.contacts`
IFNULL(a, b) is equivalent to COALESCE(a, b) but only accepts two arguments. Use COALESCE when you have more than one fallback.

Expected output

contact_idnameemailcompanyphone
1Alice Johnsonalice@acme.comAcme+1-555-0101
2Bob SmithUnknownUnknown+1-555-0102
3Unknowncarol@initech.comInitechN/A

Tips and gotchas

Be careful with empty strings vs NULLs. Some sources return "" instead of NULL. COALESCE and fillna will not replace empty strings. To handle both:
  • SQL: COALESCE(NULLIF(name, ''), 'Unknown')NULLIF turns "" into NULL first.
  • PySpark: F.when(F.col("name").isNull() | (F.col("name") == ""), "Unknown").otherwise(F.col("name"))
Choose your default values carefully. Using "Unknown" or "N/A" is common, but in some cases 0 for numbers or a specific sentinel date (e.g., 1970-01-01) may be more appropriate. Document your conventions so downstream consumers know what to expect.