Skip to main content

When to use this

Data from a single source is often split across multiple tables — a CRM has deals and companies, an e-commerce platform has orders and customers. Joining these tables lets you enrich one entity with data from another (e.g., adding the company name to each deal, or the customer email to each order).

Sample input

Two tables in the Raw layer: orders
order_idcustomer_idamountorder_date
1001201249.992024-03-15
100220289.502024-03-16
1003999150.002024-03-17
customers
customer_idnameemailcountry
201Alice Johnsonalice@acme.comBR
202Bob Smithbob@globex.comUS
203Carol Leecarol@initech.comGB
Order 1003 references customer_id = 999, which doesn’t exist in the customers table. Customer 203 has no orders. This makes it a good scenario to compare join types.

Implementation

The same LEFT JOIN syntax applies in BigQuery.
SELECT
  o.order_id,
  o.amount,
  o.order_date,
  c.name          AS customer_name,
  c.email         AS customer_email,
  c.country       AS customer_country
FROM `raw.orders` AS o
LEFT JOIN `raw.customers` AS c
  ON o.customer_id = c.customer_id
BigQuery also supports USING when the join column has the same name in both tables:
FROM `raw.orders` AS o
LEFT JOIN `raw.customers` AS c USING (customer_id)

Expected output (LEFT JOIN)

order_idamountorder_datecustomer_namecustomer_emailcustomer_country
1001249.992024-03-15Alice Johnsonalice@acme.comBR
100289.502024-03-16Bob Smithbob@globex.comUS
1003150.002024-03-17NULLNULLNULL
Order 1003 is kept with NULL customer fields. Customer 203 (Carol) doesn’t appear because she has no matching orders.

Tips and gotchas

Joins can multiply rows if the relationship is not one-to-one. For example, if a customer has 3 orders and you join customers to orders, you’ll get 3 rows for that customer. Always check the join cardinality and verify the row count after joining.
When joining large tables, performance matters. Both Athena and BigQuery optimize joins automatically, but you can help by:
  • Filtering each table before joining (use CTEs or subqueries to reduce row counts)
  • Joining on indexed or partitioned columns when possible
  • Avoiding SELECT * — only select the columns you need