Parse JSON fields

When to use this

Many APIs store complex data as JSON strings inside a single column — for example, an address field that contains {"city": "São Paulo", "state": "SP", "zip": "01310-100"}, or a metadata column with nested properties. To query individual values, you need to parse the JSON and extract the fields you care about.

Sample input

A customers table in the Raw layer where the address column is a JSON string:

customer_id	name	address
1	Alice Johnson	{“city”: “São Paulo”, “state”: “SP”, “zip”: “01310-100”}
2	Bob Smith	{“city”: “New York”, “state”: “NY”, “zip”: “10001”}
3	Carol Lee	{“city”: “London”, “state”: “England”, “zip”: “EC1A 1BB”}

We want to extract city, state, and zip into their own columns.

Implementation

Nekt Express / BigQuery
Athena SQL
Python (Nekt SDK)

BigQuery uses JSON_EXTRACT_SCALAR (or the shorthand JSON_VALUE in Standard SQL) to pull values from a JSON string.

SELECT
  customer_id,
  name,
  JSON_VALUE(address, '$.city')  AS city,
  JSON_VALUE(address, '$.state') AS state,
  JSON_VALUE(address, '$.zip')   AS zip
FROM `raw.customers`

JSON_VALUE returns a STRING. To extract typed values, use JSON_EXTRACT with a CAST:

CAST(JSON_EXTRACT(data, '$.age') AS INT64) AS age

For arrays nested inside JSON, combine with JSON_EXTRACT_ARRAY and UNNEST.

Athena (Trino) provides json_extract_scalar to pull individual values from a JSON string.

SELECT
  customer_id,
  name,
  json_extract_scalar(address, '$.city')  AS city,
  json_extract_scalar(address, '$.state') AS state,
  json_extract_scalar(address, '$.zip')   AS zip
FROM raw.customers

For nested JSON, chain the path: json_extract_scalar(data, '$.address.city'). For non-string values (numbers, booleans), use json_extract and then CAST:

CAST(json_extract(data, '$.age') AS INTEGER) AS age

In PySpark, use from_json to parse the JSON string into a struct, then access individual fields with dot notation.

import nekt
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType

df = nekt.load_table(layer_name="Raw", table_name="customers")

address_schema = StructType([
    StructField("city", StringType()),
    StructField("state", StringType()),
    StructField("zip", StringType())
])

parsed_df = (
    df
    .withColumn("address_parsed", F.from_json(F.col("address"), address_schema))
    .select(
        "customer_id",
        "name",
        F.col("address_parsed.city").alias("city"),
        F.col("address_parsed.state").alias("state"),
        F.col("address_parsed.zip").alias("zip")
    )
)

nekt.save_table(
    df=parsed_df,
    layer_name="Trusted",
    table_name="customers_with_address"
)

If you don’t know the JSON schema upfront, you can use get_json_object for quick extraction without defining a struct:

df.withColumn("city", F.get_json_object("address", "$.city"))

Expected output

customer_id	name	city	state	zip
1	Alice Johnson	São Paulo	SP	01310-100
2	Bob Smith	New York	NY	10001
3	Carol Lee	London	England	EC1A 1BB

Tips and gotchas

JSON extraction returns NULL when a key doesn’t exist in the object. If some rows have different JSON shapes (e.g., some include country and others don’t), the missing keys will silently become NULL — combine this with the Handle NULL values recipe as needed.

If your JSON contains arrays (e.g., "tags": ["hot", "enterprise"]), you’ll need to combine JSON parsing with Unnest arrays. First extract the array with JSON functions, then unnest the result.

​When to use this

​Sample input

​Implementation

​Expected output

​Tips and gotchas

When to use this

Sample input

Implementation

Expected output

Tips and gotchas