Cast / convert data types

When to use this

Raw data from APIs frequently arrives with everything as strings — dates stored as "2024-03-15", numbers as "49.99", and booleans as "true". Before you can run date math, numeric aggregations, or boolean filters, you need to cast these columns to their proper types.

Sample input

An orders table in the Raw layer where every column is a string:

order_id	order_date	total_amount	is_paid
1001	2024-03-15 10:30:00	249.99	true
1002	2024-03-16 14:22:00	89.50	false
1003	2024-03-17 09:15:00	1200.00	true

We want order_date as a timestamp, total_amount as a decimal/float, and is_paid as a boolean.

Implementation

Nekt Express / BigQuery
Athena SQL
Python (Nekt SDK)

BigQuery uses CAST and PARSE_TIMESTAMP for flexible date parsing.

SELECT
  order_id,
  CAST(order_date AS TIMESTAMP)       AS order_date,
  CAST(total_amount AS FLOAT64)       AS total_amount,
  CAST(is_paid AS BOOL)               AS is_paid
FROM `raw.orders`

For non-standard date formats, use PARSE_TIMESTAMP:

PARSE_TIMESTAMP('%d/%m/%Y', order_date) AS order_date

Athena supports CAST and convenience functions like date_parse for timestamp formatting.

SELECT
  order_id,
  CAST(order_date AS TIMESTAMP)    AS order_date,
  CAST(total_amount AS DOUBLE)     AS total_amount,
  CAST(is_paid AS BOOLEAN)         AS is_paid
FROM raw.orders

If the date string uses a non-standard format (e.g., 15/03/2024), use date_parse instead:

date_parse(order_date, '%d/%m/%Y') AS order_date

In PySpark, use cast() on each column or to_timestamp for date parsing.

import nekt
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType, BooleanType

df = nekt.load_table(layer_name="Raw", table_name="orders")

casted_df = (
    df
    .withColumn("order_date", F.to_timestamp("order_date"))
    .withColumn("total_amount", F.col("total_amount").cast(DoubleType()))
    .withColumn("is_paid", F.col("is_paid").cast(BooleanType()))
)

nekt.save_table(
    df=casted_df,
    layer_name="Trusted",
    table_name="orders_typed"
)

For non-standard date formats, pass the format string to to_timestamp:

F.to_timestamp("order_date", "dd/MM/yyyy")

Expected output

order_id	order_date	total_amount	is_paid
1001	2024-03-15 10:30:00.000	249.99	true
1002	2024-03-16 14:22:00.000	89.50	false
1003	2024-03-17 09:15:00.000	1200.00	true

The values look similar, but they are now proper typed columns — you can run SUM(total_amount), WHERE is_paid = true, and date math on order_date.

Tips and gotchas

A CAST that fails (e.g., casting "N/A" to a number) will produce NULL in BigQuery and PySpark, but will fail the query in Athena. Use TRY_CAST in Athena to get NULL instead of an error:

TRY_CAST(total_amount AS DOUBLE) AS total_amount

When casting dates, always verify the timezone behavior. Athena and BigQuery may interpret timestamps differently depending on your session or dataset settings. Explicitly set the timezone when it matters:

Athena: AT TIME ZONE 'UTC'
BigQuery: TIMESTAMP(order_date, 'UTC')
PySpark: F.to_utc_timestamp(col, "America/Sao_Paulo")

​When to use this

​Sample input

​Implementation

​Expected output

​Tips and gotchas

When to use this

Sample input

Implementation

Expected output

Tips and gotchas