Skip to article frontmatterSkip to article content

SQL Joins

Just like in pandas, you can use SQL to perform joins on your data. Common join types include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.

A join combines data from two or more tables based on a related column between them.

Instead of storing all data in one huge table, databases normalize data into smaller, related tables - and joins allow us to query them together.


🎯 Why Use Joins?

Without joins, you would have to:

  • Query each table separately
  • Manually match related rows in your application code

With joins, you can:

  • Bring together related data in a single SQL query
  • Analyze relationships between entities (e.g., customers → orders)
  • Avoid redundant data storage (follow normalization best practices)

⚙️ How Joins Work

At a high level:

  1. Each table has a key column (like customer_id).
  2. SQL compares these key columns across tables.
  3. Rows with matching key values are combined into one result.
SELECT *
FROM customers
JOIN orders
  ON customers.customer_id = orders.customer_id;

▶️ Import pandas, numpy, and sqlite3.

import pandas as pd
import numpy as np
import sqlite3

🔗 Types of Joins

SQL supports several types of joins, each controlling which rows are included.

🤝 INNER JOIN

An inner join returns rows that have matching values in both tables.

SELECT *
FROM customers AS c
INNER JOIN orders AS o
ON c.customer_id = o.customer_id;

Result: Only customers who have placed at least one order.

⬅️ LEFT JOIN (or LEFT OUTER JOIN)

A left join returns all rows from the left table, and matching rows from the right table.

If there’s no match, you’ll still get the left table’s data, with NULLs on the right side.

SELECT *
FROM customers AS c
LEFT JOIN orders AS o
  ON c.customer_id = o.customer_id;

Result: All customers, even those who never placed an order.

➡️ RIGHT JOIN (or RIGHT OUTER JOIN)

A right join returns all rows from the right table and matching rows from the left table.

A right join is like a left join, but reversed.

SELECT *
FROM customers AS c
RIGHT JOIN orders AS o
  ON c.customer_id = o.customer_id;

Result: All orders, even those placed by customers not in the customers table. This is less common since usually the “left” table is the primary entity.

↔️ FULL JOIN (or FULL OUTER JOIN)

A full join returns all rows when there’s a match in either table.

SELECT *
FROM customers AS c
FULL OUTER JOIN orders AS o
  ON c.customer_id = o.customer_id;

Result: All customers and all orders -- matching or not.

🔀 CROSS JOIN

A cross join produces the Cartesian product - every row in the first table paired with every row in the second.

SELECT *
FROM customers
CROSS JOIN stores;

Result: If there are 10 customers and 3 stores → 30 combinations.

Use Case: Generate all possible pairings (e.g., “every store--customer combo”).

🔁 SELF JOIN

A self join is where a table joins to itself - used for hierarchical or comparative data.

SELECT e1.name AS employee, e2.name AS manager FROM employees AS e1 JOIN employees AS e2 ON e1.manager_id = e2.id;

Use Case: Find each employee’s manager in the same table.

📌 Summary Table

Join TypeUnmatched Rows IncludedTypical Use
INNER JOINOnly matches
LEFT JOIN✅ (left only)Keep all from left
RIGHT JOIN✅ (right only)Keep all from right
FULL JOIN✅ (both sides)Combine all
CROSS JOIN✅ (all combinations)Generate combinations
SELF JOINHierarchical data

🆚 Join Examples using Game App Purchases Data

▶️ Create an in-memory SQLite database and three tables: users, sessions, and purchases.

# Connect to an in-memory SQLite database
conn = sqlite3.connect(":memory:")
cur = conn.cursor()

# --- Create tables ---
cur.execute(
    """
    CREATE TABLE users (
        user_id INTEGER PRIMARY KEY,
        username TEXT,
        country TEXT,
        registration_date TEXT
    )"""
)
cur.execute(
    """
    CREATE TABLE sessions (
        session_id INTEGER PRIMARY KEY,
        user_id INTEGER,
        session_start TEXT,
        duration_minutes INTEGER,
        FOREIGN KEY (user_id) REFERENCES users(user_id)
    )"""
)
cur.execute(
    """
    CREATE TABLE purchases (
        purchase_id INTEGER PRIMARY KEY,
        session_id INTEGER,
        item_name TEXT,
        price_usd REAL,
        FOREIGN KEY (session_id) REFERENCES sessions(session_id)
    )"""
)

conn.commit()

▶️ Populate the tables with sample data.

Note the following about the sample data:

  • “gamer_Z” and “newbie” have no sessions.
  • Session 112 for user 8 has no purchases
# Users
# "gamer_Z" and "newbie" have no sessions.
users_data = [
    (1, "alpha_player", "USA", "2025-01-15"),
    (2, "beta_tester", "CAN", "2025-01-20"),
    (3, "charlie_cat", "USA", "2025-02-11"),
    (4, "delta_dawn", "GBR", "2025-03-05"),
    (5, "echo_storm", "USA", "2025-04-22"),
    (6, "foxtrot_fan", "CAN", "2025-05-18"),
    (7, "gamma_ray", "DEU", "2025-06-01"),
    (8, "hotel_hero", "USA", "2025-07-14"),
    (9, "gamer_Z", "USA", "2025-08-01"),
    (10, "newbie", "GBR", "2025-09-10"),
]
cur.executemany("INSERT INTO users VALUES (?, ?, ?, ?)", users_data)

# Sessions
# Session 112 for user 8 has no purchases
sessions_data = [
    (101, 1, "2025-10-01 08:00", 45),
    (102, 2, "2025-10-01 09:15", 30),
    (103, 1, "2025-10-01 11:30", 65),
    (104, 3, "2025-10-02 14:00", 25),
    (105, 4, "2025-10-02 15:10", 90),
    (106, 5, "2025-10-03 18:45", 50),
    (107, 6, "2025-10-03 20:00", 120),
    (108, 2, "2025-10-04 09:00", 40),
    (109, 7, "2025-10-04 13:20", 85),
    (110, 3, "2025-10-05 16:50", 15),
    (111, 4, "2025-10-05 19:00", 75),
    (112, 8, "2025-10-06 10:00", 35),
]
cur.executemany("INSERT INTO sessions VALUES (?, ?, ?, ?)", sessions_data)

# Purchases
purchases_data = [
    (1001, 101, "Gold Coins", 4.99),
    (1002, 102, "Power-Up", 1.99),
    (1003, 103, "Special Skin", 9.99),
    (1004, 103, "Gold Coins", 4.99),
    (1005, 105, "Expansion Pack", 14.99),
    (1006, 105, "Power-Up", 1.99),
    (1007, 107, "Gold Coins", 4.99),
    (1008, 109, "Special Skin", 9.99),
    (1009, 110, "Power-Up", 1.99),
    (1010, 111, "Gold Coins", 4.99),
]
cur.executemany("INSERT INTO purchases VALUES (?, ?, ?, ?)", purchases_data)
conn.commit()

🛒 Finding All Successful Purchases

An INNER JOIN finds the intersection of data across all tables. It only returns rows that have a complete, matching record from the Users table all the way to the Purchases table.

Business Question: “Show me the usernames, countries, and items for all successful purchases.”

Logic: To answer this, we must link Users to Sessions (to find the user for a session) and then link Sessions to Purchases (to find the purchase in that session).

query = """
SELECT
    u.username,
    u.country,
    p.item_name,
    p.price_usd
FROM
    users u
INNER JOIN sessions s ON u.user_id = s.user_id
INNER JOIN purchases p ON s.session_id = p.session_id
ORDER BY
    u.username;
"""

pd.read_sql_query(query, con=conn)
Loading...

Only users who had a session that resulted in a purchase are shown. Users like ‘hotel_hero’ (no purchase) and ‘gamer_Z’ (no sessions) are excluded.


📊 USA Users and Their Session Durations

A LEFT JOIN is perfect for finding gaps in your data. It will return all rows from the “left” table, regardless of whether a match is found in the tables you join to it.

Business Question: “List all registered users from the USA and, if they had any sessions, show their session duration. We need to see all USA users, even if they’ve never played.”

Logic: We start with Users as our left table to ensure every user is included. We then filter for country = 'USA'.

query = """
SELECT
    u.username,
    u.registration_date,
    s.duration_minutes
FROM
    users u
LEFT JOIN sessions s ON u.user_id = s.user_id
WHERE
    u.country = 'USA';
"""

pd.read_sql_query(query, con=conn)
Loading...

💳 Calculating Total Spending Per User

This is a classic analytics task that combines joins with aggregation to summarize data.

Business Question: “How much has each paying user spent in total? Show me the top spenders first.”

Logic: We need to connect users to their purchases. To do this, we INNER JOIN all three tables to link usernames to the prices of items they bought. Then, we group the results by username and use the SUM() function to calculate the total price.

query = """
SELECT
    u.username,
    u.country,
    SUM(p.price_usd) AS total_spent
FROM
    Users u
INNER JOIN Sessions s ON u.user_id = s.user_id
INNER JOIN Purchases p ON s.session_id = p.session_id
GROUP BY
    u.username, u.country
ORDER BY
    total_spent DESC;
"""

pd.read_sql_query(query, conn)
Loading...

👻 Finding Inactive Users

LEFT JOIN is the perfect tool for finding what’s missing. Here, we can find users who signed up but never engaged with the app.

Business Question: “Which users have registered but have never started a session?”

Logic: We perform a LEFT JOIN from Users to Sessions. This keeps every user in the result. For users who have no matching sessions, the columns from the Sessions table (like session_id) will be NULL. We can then filter for these NULL values.

query = """
SELECT
    u.user_id,
    u.username,
    u.registration_date
FROM
    Users u
LEFT JOIN Sessions s ON u.user_id = s.user_id
WHERE
    s.session_id IS NULL;
"""

pd.read_sql_query(query, conn)
Loading...

📉 Finding Sessions Without Purchases

This is similar to the previous example but helps identify user sessions that didn’t convert to a sale.

Business Question: “Which sessions ended without a purchase? Show me the user and the session duration.”

Logic: We start with the Sessions table and LEFT JOIN it to Purchases. Any session that has no matching purchase will have NULL in the purchase_id column. We also join back to Users to get the username.

query = """
SELECT
    u.username,
    s.session_id,
    s.duration_minutes
FROM
    Sessions s
LEFT JOIN Purchases p ON s.session_id = p.session_id
INNER JOIN Users u ON s.user_id = u.user_id
WHERE
    p.purchase_id IS NULL;
"""

pd.read_sql_query(query, conn)
Loading...

🧪 Complex Query with HAVING Clause

Now let’s combine everything to answer a highly specific and valuable business question.

Business Question: “From our Canadian users, who are the customers that have spent more than $3.00 in total?”

Logic: This requires multiple steps:

  1. INNER JOIN all tables to link users to their spending.
  2. WHERE clause to filter for users from Canada (CAN) only.
  3. GROUP BY username to aggregate their purchases.
  4. HAVING clause to filter these groups to include only those whose total spending (SUM(price_usd)) is greater than $3.00.
query = """
SELECT
    u.username,
    SUM(p.price_usd) AS total_spent_in_can
FROM
    Users u
INNER JOIN Sessions s ON u.user_id = s.user_id
INNER JOIN Purchases p ON s.session_id = p.session_id
WHERE
    u.country = 'CAN'
GROUP BY
    u.username
HAVING
    SUM(p.price_usd) > 3.00;
"""

pd.read_sql_query(query, conn)
Loading...