SQL Aggregations

Just like in pandas, you can use SQL to perform aggregations on your data. Common aggregation functions include COUNT(), SUM(), AVG(), MIN(), and MAX().

The syntax for using these functions is straightforward. Here are some examples:

-- Count the number of listings
SELECT COUNT(*) AS total_listings
FROM listings;

-- Calculate the average price of listings
SELECT AVG(price) AS average_price
FROM listings;

▶️ Import pandas, numpy, and sqlite3.

import pandas as pd
import numpy as np
import sqlite3

🆚 Comparing Pandas and SQL Aggregations using Airbnb Listings¶

In this section, we will compare how to perform similar tasks using both pandas and SQL. We will use a dataset of Airbnb listings for this comparison.

▶️ Create a DataFrame named df_listings.

The DataFrame contains Boston Airbnb listings data.

df_listings = pd.read_csv(
    "https://github.com/bdi475/datasets/raw/main/boston-airbnb-listings-small.csv"
)
df_listings_backup = df_listings.copy()

display(df_listings.head(5))

▶️ Populate the listings table. This will insert the data from the df_listings DataFrame into the listings table in the SQLite database.

conn = sqlite3.connect("airbnb-boston-medium.db")
c = conn.cursor()

tables = list(
    pd.read_sql_query("SELECT * FROM sqlite_master WHERE type='table';", con=conn)[
        "tbl_name"
    ]
)

if "listings" in tables:
    c.execute(f"DELETE FROM listings")
    conn.commit()

df_listings.to_sql(name="listings", index=False, con=conn, if_exists="append")

1339

▶️ Display the first 5 rows of the df_listings DataFrame.

display(df_listings.head())

📌 Example 1: Select all rows and columns¶

▶️ Use pandas.

display(df_listings)

▶️ Use SQLite.

query_select_all = """
SELECT *
FROM listings;
"""

df_result = pd.read_sql_query(query_select_all, con=conn)
display(df_result)

What does the semicolon (;) do in SQL?

The semicolon (;) is used to terminate a SQL statement. This is especially important when executing multiple SQL statements in a single execution block, as it helps the SQL interpreter understand where one statement ends and the next begins.

If you’re only running a single SQL statement, the semicolon is optional in many SQL environments. But it is a good practice to include it for clarity and to avoid syntax errors in some systems.

-- Because this is a single statement, the semicolon is optional here
SELECT * FROM listings

-- But if you have multiple statements, you need to use semicolons to separate them
INSERT INTO my_table (column1, column2) VALUES ('value1', 'value2');
UPDATE my_table SET column1 = 'new_value' WHERE column2 = 'value2';
DELETE FROM my_table WHERE column2 = 'value2';

📌 Example 2: Find expensive listings¶

Find listings with a price greater than $1,000.

▶️ Use pandas.

df_listings[df_listings["price"] > 1000]

▶️ Use SQLite.

query_listings_above_1000 = """
SELECT *
FROM listings
WHERE price > 1000
"""

df_result = pd.read_sql_query(query_listings_above_1000, con=conn)
display(df_result)

📌 Example 3: Find large listings¶

Filter listings with 5 or more bedrooms AND 3 or more bathrooms. Select only the "name", "bedrooms", "bathrooms", "price", "review_score" columns.

▶️ Use pandas.

df_large_listings = df_listings[
    (df_listings["bedrooms"] >= 5) & (df_listings["bathrooms"] >= 3)
]
df_large_listings = df_large_listings[
    ["name", "bedrooms", "bathrooms", "price", "review_score"]
]

display(df_large_listings)

▶️ Use SQLite.

query_large_listings = """
SELECT name, bedrooms, bathrooms, price, review_score

FROM listings
WHERE (bedrooms >= 5) AND (bathrooms >= 3)
"""

df_result = pd.read_sql_query(query_large_listings, con=conn)
display(df_result)

📌 Example 4: Find the number of listings¶

▶️ Use pandas.

num_listings = df_listings.shape[0]

print(f"There are {num_listings} Airbnb listings in the dataset.")

There are 1339 Airbnb listings in the dataset.

▶️ Use SQLite.

query_num_listings = """
SELECT COUNT(*)
FROM listings
"""

df_result = pd.read_sql_query(query_num_listings, con=conn)
display(df_result)

What does the asterisk (*) mean in SQL?

The asterisk (_) in SQL is a wildcard character that represents all columns in a table. When you use SELECT _, it means you want to retrieve all columns from the specified table. For example, SELECT _ FROM listings;will return all columns for every row in thelistingstable. When you use it with aggregation functions likeCOUNT(_), it counts all rows in the table.

If you specify a column name instead of an asterisk, such as SELECT COUNT(price), it will count only the rows where the price column is not NULL.

Alternatively, you can use SELECT COUNT(1) which counts all rows in the table, similar to COUNT(*). The choice between COUNT(*) and COUNT(1) is often a matter of preference, as they generally yield the same result. Historically, older database systems had performance differences between the two, but in modern databases, they are optimized to perform similarly.

📌 Example 5: Find the number of listings by neighbourhood¶

Find the number of listings in each neighbourhood. Sort the results by the number of listings in descending order. Display only the top 5 neighbourhoods.

▶️ Use pandas.

df_by_neighbourhood = df_listings.groupby("neighbourhood", as_index=False).agg(
    {"name": "count"}
)

df_by_neighbourhood.rename(columns={"name": "num_listings"}, inplace=True)
df_by_neighbourhood.sort_values("num_listings", ascending=False, inplace=True)

df_by_neighbourhood.head(5)

▶️ Use SQLite.

query_num_listings_by_neighbourhood = """
SELECT neighbourhood, COUNT(*) AS num_listings
FROM listings
GROUP BY neighbourhood
ORDER BY num_listings DESC
LIMIT 5;
"""

df_result = pd.read_sql_query(query_num_listings_by_neighbourhood, con=conn)
display(df_result)

📌 Example 6: Calculate the average price by room type¶

Calculate the average price for each room type. Sort the results by price in ascending order.

▶️ Use pandas.

df_price_by_room_type = df_listings.groupby("room_type", as_index=False).agg(
    {"price": "mean"}
)

df_price_by_room_type.sort_values("price", inplace=True)

df_price_by_room_type

▶️ Use SQLite.

query_price_by_room_type = """
SELECT room_type, AVG(price) as price
FROM listings
GROUP BY room_type
ORDER BY price
"""

df_result = pd.read_sql_query(query_price_by_room_type, con=conn)
display(df_result)

🎬 More Aggregation Examples¶

In this section, we will work with a Bollywood Movies Dataset that includes all 1698 Hindi language movies released in India between 2005 and 2017 from the website of Box Office India.

Source: Mendeley Data

▶️ Create a DataFrame named df_movies from a CSV file.

df_movies = pd.read_csv(
    "https://github.com/bdi475/datasets/raw/main/bollywood-movies.csv"
)
df_movies_backup = df_movies.copy()

display(df_movies.head(5))

▶️ Run the code below to populate the movies table. This will insert the data from the df_movies DataFrame into the movies table in the SQLite database.

conn = sqlite3.connect("bollywood-movies.db")
c = conn.cursor()

tables = list(
    pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn)[
        "tbl_name"
    ]
)

if "movies" in tables:
    c.execute(f"DELETE FROM movies")
    conn.commit()

df_movies.to_sql(name="movies", index=False, con=conn, if_exists="append")

1698

▶️ Display all rows and columns of the movies table.

pd.read_sql_query("SELECT * FROM movies", con=conn)

📌 Example 7: Find the top 5 movies with the largest number of screens¶

Find the top 5 movies with the largest number of screens. Only show the movie name and number of screens. Sort the results by number of screens in descending order.

query_top5_movies_by_screens = """
SELECT movie_name, num_screens
FROM movies
ORDER BY num_screens DESC
LIMIT 5;
"""

df_result = pd.read_sql_query(query_top5_movies_by_screens, con=conn)
display(df_result)

📌 Example 8: Find the top 10 lead stars by number of movies¶

Find the number of movies for each lead star. Sort the results by the number of movies in descending order. Only show the top 10 lead stars.

query_num_movies_by_lead_star = """
SELECT lead_star, COUNT(*) AS num_movies
FROM movies
GROUP BY lead_star
ORDER BY num_movies DESC
LIMIT 10;
"""

df_result = pd.read_sql_query(query_num_movies_by_lead_star, con=conn)
display(df_result)

📌 Example 9: Calculate statistics by genre¶

Find the number of movies, average revenue, and average budget for each genre. Sort the results by average revenue in descending order.

▶️ Use SQLite.

query_genre_stats = """
SELECT 
    genre,
    COUNT(*) AS num_movies,
    AVG(revenue_in_INR) AS average_revenue,
    AVG(budget_in_INR) AS average_budget
FROM movies
GROUP BY genre
ORDER BY average_revenue DESC;
"""

# Avoid scientific notations
# This is to improve readability of large numbers in the output
pd.set_option("display.float_format", lambda x: "%.2f" % x)

df_result = pd.read_sql_query(query_genre_stats, con=conn)
display(df_result)

📌 Example 10: Calculate statistics by genre and release period¶

Find the number of movies, average revenue, and average budget for each genre and release period. Sort the results by average revenue in descending order.

Exclude remakes and only include groups with at least 20 movies.

▶️ Use SQLite.

query_genre_release_period_stats = """
SELECT 
    genre,
    release_period,
    COUNT(*) AS num_movies,
    AVG(revenue_in_INR) AS average_revenue,
    AVG(budget_in_INR) AS average_budget
FROM movies
WHERE is_remake = 'No'
GROUP BY genre, release_period
HAVING num_movies >= 20
ORDER BY average_revenue DESC;
"""

df_result = pd.read_sql_query(query_genre_release_period_stats, con=conn)
display(df_result)

In this example, we passed two columns to the GROUP BY clause: genre and release_period. This means that the aggregation functions (COUNT(), AVG()) will be calculated for each unique combination of genre and release_period.

🕵️‍♀️ `WHERE` (row-level filter) vs `HAVING` (group-level filter)¶

The WHERE clause is used to filter rows before any GROUP BY operations, while the HAVING clause is used to filter groups after a GROUP BY operation has been applied.

You cannot use WHERE to filter aggregated results; for that, you must use HAVING. For example, you can only filter with WHERE is_remake = "NO" before the aggregation.

Similarly, you can only filter with HAVING num_movies >= 20 after the aggregation. This is because the num_movies is calculated during the aggregation process and is not available for filtering until after the grouping has been completed.

We can see the difference between WHERE and HAVING clauses by applying them to the same column. In the example below, we use both WHERE and HAVING to filter the results based on the budget_in_INR column.

▶️ First, we use the WHERE clause to filter out movies with a budget greater than ₹2 million before the aggregation.

This means that only movies with a budget greater than ₹2 million will be considered in the aggregation functions (COUNT(), AVG()). The filtering is done at the row level before any grouping occurs.

query_using_where = """
SELECT 
    genre,
    AVG(budget_in_INR) AS avg_budget
FROM movies
WHERE budget_in_INR > 2000000
GROUP BY genre;
"""

df_result = pd.read_sql_query(query_using_where, con=conn)
display(df_result)

▶️ Next, we use the HAVING clause to filter out groups where the average budget is greater than ₹2 million after the aggregation.

This means that all movies are considered in the aggregation functions, but only those groups where the average budget exceeds ₹2 million are included in the final result. The filtering is done at the group level after the aggregation has been completed.

query_using_HAVING = """
SELECT 
    genre,
    AVG(budget_in_INR) AS avg_budget
FROM movies
GROUP BY genre
HAVING AVG(budget_in_INR) > 2000000;
"""

df_result = pd.read_sql_query(query_using_HAVING, con=conn)
display(df_result)

The “mythological” genre is excluded in the first query because none of its movies have a budget greater than ₹2 million. However, it appears in the second query because its average budget across all movies exceeds ₹2 million.

Tools Used in This Course

Basic Queries in SQLite

Tools Used in This Course

SQL Joins

🆚 Comparing Pandas and SQL Aggregations using Airbnb Listings¶

📌 Example 1: Select all rows and columns¶

📌 Example 2: Find expensive listings¶

📌 Example 3: Find large listings¶

📌 Example 4: Find the number of listings¶

📌 Example 5: Find the number of listings by neighbourhood¶

📌 Example 6: Calculate the average price by room type¶

🎬 More Aggregation Examples¶

📌 Example 7: Find the top 5 movies with the largest number of screens¶

📌 Example 8: Find the top 10 lead stars by number of movies¶

📌 Example 9: Calculate statistics by genre¶

📌 Example 10: Calculate statistics by genre and release period¶

🕵️‍♀️ WHERE (row-level filter) vs HAVING (group-level filter)¶

🕵️‍♀️ `WHERE` (row-level filter) vs `HAVING` (group-level filter)¶