When to use this
API sources often return data with multiple levels of nesting — a user object that contains an address object which itself contains a coordinates object. While Parse JSON fields works for a single level, deeply nested structures require a systematic approach to flatten everything into a single, query-friendly row.
An events table in the Raw layer where each row has nested JSON:
| event_id | event_type | payload |
|---|
| 1 | purchase | {“user”: {“name”: “Alice”, “address”: {“city”: “São Paulo”, “country”: “BR”}}, “amount”: 250.00} |
| 2 | signup | {“user”: {“name”: “Bob”, “address”: {“city”: “New York”, “country”: “US”}}, “amount”: null} |
| 3 | purchase | {“user”: {“name”: “Carol”, “address”: {“city”: “London”, “country”: “GB”}}, “amount”: 180.50} |
We want to flatten all nested fields into top-level columns.
Implementation
Nekt Express / BigQuery
Athena SQL
Python (Nekt SDK)
Use dot-notation with JSON_VALUE to traverse the nested paths.SELECT
event_id,
event_type,
JSON_VALUE(payload, '$.user.name') AS user_name,
JSON_VALUE(payload, '$.user.address.city') AS user_city,
JSON_VALUE(payload, '$.user.address.country') AS user_country,
CAST(JSON_VALUE(payload, '$.amount') AS FLOAT64) AS amount
FROM `raw.events`
If payload is stored as a native JSON type (not a string), you can access fields directly: payload.user.name. For string columns, JSON_VALUE is required.
Chain json_extract_scalar calls to reach each nested value.SELECT
event_id,
event_type,
json_extract_scalar(payload, '$.user.name') AS user_name,
json_extract_scalar(payload, '$.user.address.city') AS user_city,
json_extract_scalar(payload, '$.user.address.country') AS user_country,
CAST(json_extract(payload, '$.amount') AS DOUBLE) AS amount
FROM raw.events
For very deep nesting, extract intermediate objects first and then parse them:WITH parsed AS (
SELECT
event_id,
json_extract(payload, '$.user') AS user_obj
FROM raw.events
)
SELECT
event_id,
json_extract_scalar(user_obj, '$.name') AS user_name,
json_extract_scalar(user_obj, '$.address.city') AS user_city
FROM parsed
In PySpark, define the full nested schema with StructType and then use dot notation to select leaf fields.import nekt
from pyspark.sql import functions as F
from pyspark.sql.types import (
StructType, StructField, StringType, DoubleType
)
df = nekt.load_table(layer_name="Raw", table_name="events")
payload_schema = StructType([
StructField("user", StructType([
StructField("name", StringType()),
StructField("address", StructType([
StructField("city", StringType()),
StructField("country", StringType())
]))
])),
StructField("amount", DoubleType())
])
flat_df = (
df
.withColumn("parsed", F.from_json(F.col("payload"), payload_schema))
.select(
"event_id",
"event_type",
F.col("parsed.user.name").alias("user_name"),
F.col("parsed.user.address.city").alias("user_city"),
F.col("parsed.user.address.country").alias("user_country"),
F.col("parsed.amount").alias("amount")
)
)
nekt.save_table(
df=flat_df,
layer_name="Trusted",
table_name="events_flat"
)
If the payload column is already stored as a native struct (not a JSON string), skip from_json and access nested fields directly with dot notation:df.select("event_id", F.col("payload.user.name").alias("user_name"))
Expected output
| event_id | event_type | user_name | user_city | user_country | amount |
|---|
| 1 | purchase | Alice | São Paulo | BR | 250.00 |
| 2 | signup | Bob | New York | US | NULL |
| 3 | purchase | Carol | London | GB | 180.50 |
Tips and gotchas
When flattening, naming collisions can happen. If both the root and a nested object have a field called name, alias them explicitly (e.g., user_name vs event_name) to avoid ambiguity.
For very wide or deeply nested JSON (10+ fields, 3+ levels), consider flattening incrementally — create an intermediate table with the first level flattened, then flatten further in a second transformation. This makes debugging and maintenance much easier.