Flatten nested structures

When to use this

API sources often return data with multiple levels of nesting — a user object that contains an address object which itself contains a coordinates object. While Parse JSON fields works for a single level, deeply nested structures require a systematic approach to flatten everything into a single, query-friendly row.

Sample input

An events table in the Raw layer where each row has nested JSON:

event_id	event_type	payload
1	purchase	{“user”: {“name”: “Alice”, “address”: {“city”: “São Paulo”, “country”: “BR”}}, “amount”: 250.00}
2	signup	{“user”: {“name”: “Bob”, “address”: {“city”: “New York”, “country”: “US”}}, “amount”: null}
3	purchase	{“user”: {“name”: “Carol”, “address”: {“city”: “London”, “country”: “GB”}}, “amount”: 180.50}

We want to flatten all nested fields into top-level columns.

Implementation

Nekt Express / BigQuery
Athena SQL
Python (Nekt SDK)

Use dot-notation with JSON_VALUE to traverse the nested paths.

SELECT
  event_id,
  event_type,
  JSON_VALUE(payload, '$.user.name')            AS user_name,
  JSON_VALUE(payload, '$.user.address.city')    AS user_city,
  JSON_VALUE(payload, '$.user.address.country') AS user_country,
  CAST(JSON_VALUE(payload, '$.amount') AS FLOAT64) AS amount
FROM `raw.events`

If payload is stored as a native JSON type (not a string), you can access fields directly: payload.user.name. For string columns, JSON_VALUE is required.

Chain json_extract_scalar calls to reach each nested value.

SELECT
  event_id,
  event_type,
  json_extract_scalar(payload, '$.user.name')            AS user_name,
  json_extract_scalar(payload, '$.user.address.city')    AS user_city,
  json_extract_scalar(payload, '$.user.address.country') AS user_country,
  CAST(json_extract(payload, '$.amount') AS DOUBLE)      AS amount
FROM raw.events

For very deep nesting, extract intermediate objects first and then parse them:

WITH parsed AS (
  SELECT
    event_id,
    json_extract(payload, '$.user') AS user_obj
  FROM raw.events
)
SELECT
  event_id,
  json_extract_scalar(user_obj, '$.name') AS user_name,
  json_extract_scalar(user_obj, '$.address.city') AS user_city
FROM parsed

In PySpark, define the full nested schema with StructType and then use dot notation to select leaf fields.

import nekt
from pyspark.sql import functions as F
from pyspark.sql.types import (
    StructType, StructField, StringType, DoubleType
)

df = nekt.load_table(layer_name="Raw", table_name="events")

payload_schema = StructType([
    StructField("user", StructType([
        StructField("name", StringType()),
        StructField("address", StructType([
            StructField("city", StringType()),
            StructField("country", StringType())
        ]))
    ])),
    StructField("amount", DoubleType())
])

flat_df = (
    df
    .withColumn("parsed", F.from_json(F.col("payload"), payload_schema))
    .select(
        "event_id",
        "event_type",
        F.col("parsed.user.name").alias("user_name"),
        F.col("parsed.user.address.city").alias("user_city"),
        F.col("parsed.user.address.country").alias("user_country"),
        F.col("parsed.amount").alias("amount")
    )
)

nekt.save_table(
    df=flat_df,
    layer_name="Trusted",
    table_name="events_flat"
)

If the payload column is already stored as a native struct (not a JSON string), skip from_json and access nested fields directly with dot notation:

df.select("event_id", F.col("payload.user.name").alias("user_name"))

Expected output

event_id	event_type	user_name	user_city	user_country	amount
1	purchase	Alice	São Paulo	BR	250.00
2	signup	Bob	New York	US	NULL
3	purchase	Carol	London	GB	180.50

Tips and gotchas

When flattening, naming collisions can happen. If both the root and a nested object have a field called name, alias them explicitly (e.g., user_name vs event_name) to avoid ambiguity.

For very wide or deeply nested JSON (10+ fields, 3+ levels), consider flattening incrementally — create an intermediate table with the first level flattened, then flatten further in a second transformation. This makes debugging and maintenance much easier.

​When to use this

​Sample input

​Implementation

​Expected output

​Tips and gotchas

When to use this

Sample input

Implementation

Expected output

Tips and gotchas