Skip to main content

When to use this

Date operations are at the heart of most analytics — calculating how many days since a customer’s last purchase, extracting the month from a timestamp for grouping, or formatting dates for a report. Each SQL engine handles dates slightly differently, so this recipe covers the most common operations side by side.

Sample input

An orders table in the Raw layer:
order_idcustomer_idorder_dateshipped_date
10012012024-01-15 10:30:002024-01-18 14:00:00
10022022024-02-20 09:00:002024-02-22 11:30:00
10032012024-03-05 16:45:00NULL
We want to:
  1. Extract the year and month from order_date
  2. Calculate the days between order and shipment
  3. Determine days since the order (relative to today)

Implementation

BigQuery uses EXTRACT, DATE_DIFF, and CURRENT_DATE().
SELECT
  order_id,
  customer_id,
  order_date,
  shipped_date,
  EXTRACT(YEAR FROM order_date)                               AS order_year,
  EXTRACT(MONTH FROM order_date)                              AS order_month,
  DATE_DIFF(CAST(shipped_date AS DATE), CAST(order_date AS DATE), DAY)  AS days_to_ship,
  DATE_DIFF(CURRENT_DATE(), CAST(order_date AS DATE), DAY)    AS days_since_order
FROM `raw.orders`
Note the argument order difference: BigQuery’s DATE_DIFF takes (end, start, part) while Athena’s date_diff takes (part, start, end). This is a common source of bugs when porting queries.

Expected output

Assuming today is 2024-12-01:
order_idcustomer_idorder_dateshipped_dateorder_yearorder_monthdays_to_shipdays_since_order
10012012024-01-15 10:30:002024-01-18 14:00:00202413321
10022022024-02-20 09:00:002024-02-22 11:30:00202422285
10032012024-03-05 16:45:00NULL20243NULL271

Tips and gotchas

Argument order differs between engines. This is the most common source of date calculation bugs:
  • Athena: date_diff('day', start, end) — unit first, then start, then end
  • BigQuery: DATE_DIFF(end, start, DAY) — end first, then start, then unit
  • PySpark: F.datediff(end, start) — always returns days, end first
Getting the order wrong produces negative numbers instead of positive ones.
When shipped_date is NULL (e.g., order 1003 hasn’t shipped yet), all date difference calculations involving that column return NULL. This is the correct behavior — use Handle NULL values if you want to replace it with a default.
Timestamps include both date and time. When calculating day differences, most engines ignore the time component and count calendar days. If you need hour-level or minute-level precision, use the appropriate unit ('hour', 'minute') in SQL or F.unix_timestamp differences in PySpark.