Skip to main content

When to use this

Raw data from APIs frequently arrives with everything as strings — dates stored as "2024-03-15", numbers as "49.99", and booleans as "true". Before you can run date math, numeric aggregations, or boolean filters, you need to cast these columns to their proper types.

Sample input

An orders table in the Raw layer where every column is a string:
order_idorder_datetotal_amountis_paid
10012024-03-15 10:30:00249.99true
10022024-03-16 14:22:0089.50false
10032024-03-17 09:15:001200.00true
We want order_date as a timestamp, total_amount as a decimal/float, and is_paid as a boolean.

Implementation

BigQuery uses CAST and PARSE_TIMESTAMP for flexible date parsing.
SELECT
  order_id,
  CAST(order_date AS TIMESTAMP)       AS order_date,
  CAST(total_amount AS FLOAT64)       AS total_amount,
  CAST(is_paid AS BOOL)               AS is_paid
FROM `raw.orders`
For non-standard date formats, use PARSE_TIMESTAMP:
PARSE_TIMESTAMP('%d/%m/%Y', order_date) AS order_date

Expected output

order_idorder_datetotal_amountis_paid
10012024-03-15 10:30:00.000249.99true
10022024-03-16 14:22:00.00089.50false
10032024-03-17 09:15:00.0001200.00true
The values look similar, but they are now proper typed columns — you can run SUM(total_amount), WHERE is_paid = true, and date math on order_date.

Tips and gotchas

A CAST that fails (e.g., casting "N/A" to a number) will produce NULL in BigQuery and PySpark, but will fail the query in Athena. Use TRY_CAST in Athena to get NULL instead of an error:
TRY_CAST(total_amount AS DOUBLE) AS total_amount
When casting dates, always verify the timezone behavior. Athena and BigQuery may interpret timestamps differently depending on your session or dataset settings. Explicitly set the timezone when it matters:
  • Athena: AT TIME ZONE 'UTC'
  • BigQuery: TIMESTAMP(order_date, 'UTC')
  • PySpark: F.to_utc_timestamp(col, "America/Sao_Paulo")