Handle NULL values

When to use this

Raw data almost always has gaps — optional fields left blank, API responses with missing keys, or records that were partially synced. NULLs can break aggregations, cause unexpected JOIN behavior, and confuse BI tools. Handling them explicitly ensures your trusted layer is clean and predictable.

Sample input

A contacts table in the Raw layer with some missing values:

contact_id	name	email	company	phone
1	Alice Johnson	alice@acme.com	Acme	+1-555-0101
2	Bob Smith	NULL	NULL	+1-555-0102
3	NULL	carol@initech.com	Initech	NULL

We want to replace NULLs with sensible defaults: "Unknown" for text fields and "N/A" for phone.

Implementation

Nekt Express / BigQuery
Athena SQL
Python (Nekt SDK)

BigQuery supports the same COALESCE syntax. You can also use IFNULL as a shorthand when there are only two arguments.

SELECT
  contact_id,
  COALESCE(name, 'Unknown')     AS name,
  IFNULL(email, 'Unknown')      AS email,
  COALESCE(company, 'Unknown')  AS company,
  IFNULL(phone, 'N/A')          AS phone
FROM `raw.contacts`

IFNULL(a, b) is equivalent to COALESCE(a, b) but only accepts two arguments. Use COALESCE when you have more than one fallback.

Use COALESCE to return the first non-NULL value from a list of arguments.

SELECT
  contact_id,
  COALESCE(name, 'Unknown')     AS name,
  COALESCE(email, 'Unknown')    AS email,
  COALESCE(company, 'Unknown')  AS company,
  COALESCE(phone, 'N/A')        AS phone
FROM raw.contacts

You can chain multiple fallbacks: COALESCE(preferred_email, work_email, personal_email, 'Unknown') returns the first non-NULL value from left to right.

In PySpark, use fillna to replace NULLs across multiple columns at once, or coalesce for column-level logic.

import nekt

df = nekt.load_table(layer_name="Raw", table_name="contacts")

clean_df = df.fillna({
    "name": "Unknown",
    "email": "Unknown",
    "company": "Unknown",
    "phone": "N/A"
})

nekt.save_table(
    df=clean_df,
    layer_name="Trusted",
    table_name="contacts_clean"
)

For more complex logic (e.g., falling back to another column), use F.coalesce:

from pyspark.sql import functions as F

df = df.withColumn(
    "email",
    F.coalesce(F.col("email"), F.col("secondary_email"), F.lit("Unknown"))
)

Expected output

contact_id	name	email	company	phone
1	Alice Johnson	alice@acme.com	Acme	+1-555-0101
2	Bob Smith	Unknown	Unknown	+1-555-0102
3	Unknown	carol@initech.com	Initech	N/A

Tips and gotchas

Be careful with empty strings vs NULLs. Some sources return "" instead of NULL. COALESCE and fillna will not replace empty strings. To handle both:

SQL: COALESCE(NULLIF(name, ''), 'Unknown') — NULLIF turns "" into NULL first.
PySpark: F.when(F.col("name").isNull() | (F.col("name") == ""), "Unknown").otherwise(F.col("name"))

Choose your default values carefully. Using "Unknown" or "N/A" is common, but in some cases 0 for numbers or a specific sentinel date (e.g., 1970-01-01) may be more appropriate. Document your conventions so downstream consumers know what to expect.

​When to use this

​Sample input

​Implementation

​Expected output

​Tips and gotchas

When to use this

Sample input

Implementation

Expected output

Tips and gotchas