Join tables

When to use this

Data from a single source is often split across multiple tables — a CRM has deals and companies, an e-commerce platform has orders and customers. Joining these tables lets you enrich one entity with data from another (e.g., adding the company name to each deal, or the customer email to each order).

Sample input

Two tables in the Raw layer: orders

order_id	customer_id	amount	order_date
1001	201	249.99	2024-03-15
1002	202	89.50	2024-03-16
1003	999	150.00	2024-03-17

customers

customer_id	name	email	country
201	Alice Johnson	alice@acme.com	BR
202	Bob Smith	bob@globex.com	US
203	Carol Lee	carol@initech.com	GB

Order 1003 references customer_id = 999, which doesn’t exist in the customers table. Customer 203 has no orders. This makes it a good scenario to compare join types.

Implementation

Nekt Express / BigQuery
Athena SQL
Python (Nekt SDK)

The same LEFT JOIN syntax applies in BigQuery.

SELECT
  o.order_id,
  o.amount,
  o.order_date,
  c.name          AS customer_name,
  c.email         AS customer_email,
  c.country       AS customer_country
FROM `raw.orders` AS o
LEFT JOIN `raw.customers` AS c
  ON o.customer_id = c.customer_id

BigQuery also supports USING when the join column has the same name in both tables:

FROM `raw.orders` AS o
LEFT JOIN `raw.customers` AS c USING (customer_id)

Use LEFT JOIN to keep all orders and enrich with customer data when available.

SELECT
  o.order_id,
  o.amount,
  o.order_date,
  c.name          AS customer_name,
  c.email         AS customer_email,
  c.country       AS customer_country
FROM raw.orders AS o
LEFT JOIN raw.customers AS c
  ON o.customer_id = c.customer_id

Join type cheat sheet:

INNER JOIN — only rows that match in both tables (would exclude order 1003 and customer 203)
LEFT JOIN — all rows from the left table, with NULLs for unmatched right rows (keeps order 1003, excludes customer 203)
FULL OUTER JOIN — all rows from both tables (keeps everything, with NULLs on both sides for unmatched rows)

In PySpark, use the join method specifying the join column and type.

import nekt

orders_df = nekt.load_table(layer_name="Raw", table_name="orders")
customers_df = nekt.load_table(layer_name="Raw", table_name="customers")

enriched_df = orders_df.join(
    customers_df,
    on="customer_id",
    how="left"
).select(
    "order_id",
    "amount",
    "order_date",
    customers_df["name"].alias("customer_name"),
    customers_df["email"].alias("customer_email"),
    customers_df["country"].alias("customer_country")
)

nekt.save_table(
    df=enriched_df,
    layer_name="Trusted",
    table_name="orders_enriched"
)

When join columns have different names, use a condition instead of a column name:

orders_df.join(
    customers_df,
    orders_df["cust_id"] == customers_df["customer_id"],
    how="left"
)

Change how to "inner", "left", "right", or "full" depending on your needs.

Expected output (LEFT JOIN)

order_id	amount	order_date	customer_name	customer_email	customer_country
1001	249.99	2024-03-15	Alice Johnson	alice@acme.com	BR
1002	89.50	2024-03-16	Bob Smith	bob@globex.com	US
1003	150.00	2024-03-17	NULL	NULL	NULL

Order 1003 is kept with NULL customer fields. Customer 203 (Carol) doesn’t appear because she has no matching orders.

Tips and gotchas

Joins can multiply rows if the relationship is not one-to-one. For example, if a customer has 3 orders and you join customers to orders, you’ll get 3 rows for that customer. Always check the join cardinality and verify the row count after joining.

When joining large tables, performance matters. Both Athena and BigQuery optimize joins automatically, but you can help by:

Filtering each table before joining (use CTEs or subqueries to reduce row counts)
Joining on indexed or partitioned columns when possible
Avoiding SELECT * — only select the columns you need

​When to use this

​Sample input

​Implementation

​Expected output (LEFT JOIN)

​Tips and gotchas

When to use this

Sample input

Implementation

Expected output (LEFT JOIN)

Tips and gotchas