When to use this
Raw data from APIs frequently arrives with everything as strings — dates stored as"2024-03-15", numbers as "49.99", and booleans as "true". Before you can run date math, numeric aggregations, or boolean filters, you need to cast these columns to their proper types.
Sample input
Anorders table in the Raw layer where every column is a string:
| order_id | order_date | total_amount | is_paid |
|---|---|---|---|
| 1001 | 2024-03-15 10:30:00 | 249.99 | true |
| 1002 | 2024-03-16 14:22:00 | 89.50 | false |
| 1003 | 2024-03-17 09:15:00 | 1200.00 | true |
order_date as a timestamp, total_amount as a decimal/float, and is_paid as a boolean.
Implementation
- Nekt Express / BigQuery
- Athena SQL
- Python (Nekt SDK)
BigQuery uses
CAST and PARSE_TIMESTAMP for flexible date parsing.Expected output
| order_id | order_date | total_amount | is_paid |
|---|---|---|---|
| 1001 | 2024-03-15 10:30:00.000 | 249.99 | true |
| 1002 | 2024-03-16 14:22:00.000 | 89.50 | false |
| 1003 | 2024-03-17 09:15:00.000 | 1200.00 | true |
SUM(total_amount), WHERE is_paid = true, and date math on order_date.
Tips and gotchas
When casting dates, always verify the timezone behavior. Athena and BigQuery may interpret timestamps differently depending on your session or dataset settings. Explicitly set the timezone when it matters:
- Athena:
AT TIME ZONE 'UTC' - BigQuery:
TIMESTAMP(order_date, 'UTC') - PySpark:
F.to_utc_timestamp(col, "America/Sao_Paulo")