When to use this
Data from a single source is often split across multiple tables — a CRM has deals and companies, an e-commerce platform has orders and customers. Joining these tables lets you enrich one entity with data from another (e.g., adding the company name to each deal, or the customer email to each order).
Two tables in the Raw layer:
orders
| order_id | customer_id | amount | order_date |
|---|
| 1001 | 201 | 249.99 | 2024-03-15 |
| 1002 | 202 | 89.50 | 2024-03-16 |
| 1003 | 999 | 150.00 | 2024-03-17 |
customers
| customer_id | name | email | country |
|---|
| 201 | Alice Johnson | alice@acme.com | BR |
| 202 | Bob Smith | bob@globex.com | US |
| 203 | Carol Lee | carol@initech.com | GB |
Order 1003 references customer_id = 999, which doesn’t exist in the customers table. Customer 203 has no orders. This makes it a good scenario to compare join types.
Implementation
Nekt Express / BigQuery
Athena SQL
Python (Nekt SDK)
The same LEFT JOIN syntax applies in BigQuery.SELECT
o.order_id,
o.amount,
o.order_date,
c.name AS customer_name,
c.email AS customer_email,
c.country AS customer_country
FROM `raw.orders` AS o
LEFT JOIN `raw.customers` AS c
ON o.customer_id = c.customer_id
BigQuery also supports USING when the join column has the same name in both tables:FROM `raw.orders` AS o
LEFT JOIN `raw.customers` AS c USING (customer_id)
Use LEFT JOIN to keep all orders and enrich with customer data when available.SELECT
o.order_id,
o.amount,
o.order_date,
c.name AS customer_name,
c.email AS customer_email,
c.country AS customer_country
FROM raw.orders AS o
LEFT JOIN raw.customers AS c
ON o.customer_id = c.customer_id
Join type cheat sheet:
INNER JOIN — only rows that match in both tables (would exclude order 1003 and customer 203)
LEFT JOIN — all rows from the left table, with NULLs for unmatched right rows (keeps order 1003, excludes customer 203)
FULL OUTER JOIN — all rows from both tables (keeps everything, with NULLs on both sides for unmatched rows)
In PySpark, use the join method specifying the join column and type.import nekt
orders_df = nekt.load_table(layer_name="Raw", table_name="orders")
customers_df = nekt.load_table(layer_name="Raw", table_name="customers")
enriched_df = orders_df.join(
customers_df,
on="customer_id",
how="left"
).select(
"order_id",
"amount",
"order_date",
customers_df["name"].alias("customer_name"),
customers_df["email"].alias("customer_email"),
customers_df["country"].alias("customer_country")
)
nekt.save_table(
df=enriched_df,
layer_name="Trusted",
table_name="orders_enriched"
)
When join columns have different names, use a condition instead of a column name:orders_df.join(
customers_df,
orders_df["cust_id"] == customers_df["customer_id"],
how="left"
)
Change how to "inner", "left", "right", or "full" depending on your needs.
Expected output (LEFT JOIN)
| order_id | amount | order_date | customer_name | customer_email | customer_country |
|---|
| 1001 | 249.99 | 2024-03-15 | Alice Johnson | alice@acme.com | BR |
| 1002 | 89.50 | 2024-03-16 | Bob Smith | bob@globex.com | US |
| 1003 | 150.00 | 2024-03-17 | NULL | NULL | NULL |
Order 1003 is kept with NULL customer fields. Customer 203 (Carol) doesn’t appear because she has no matching orders.
Tips and gotchas
Joins can multiply rows if the relationship is not one-to-one. For example, if a customer has 3 orders and you join customers to orders, you’ll get 3 rows for that customer. Always check the join cardinality and verify the row count after joining.
When joining large tables, performance matters. Both Athena and BigQuery optimize joins automatically, but you can help by:
- Filtering each table before joining (use CTEs or subqueries to reduce row counts)
- Joining on indexed or partitioned columns when possible
- Avoiding
SELECT * — only select the columns you need