Skip to article frontmatterSkip to article content

▶️ First, run the code cell below to import unittest, a module used for 🧭 Check Your Work sections and the autograder.

# DO NOT MODIFY THE CODE IN THIS CELL
import unittest

tc = unittest.TestCase()

💎 Case Overview

There are four Transportation Network Providers (often called rideshare companies 🚗) licensed to operate in Chicago. These rideshare companies are required to routinely report vehicle, driver, and trip information to the City of Chicago, which is published on the Chicago Data Portal. The latest vehicle dataset can be downloaded at registered vehicles dataset. The original dataset has been preprocessed to fit this case study.

The reporting is done on a monthly basis, as indicated by the REPORTED_YEAR and REPORTED_MONTH columns. For each registered vehicle at a given month, the following information is provided:

Column NameDescription
REPORTD_YEARThe year in which the vehicle was reported
REPORTED_MONTHThe month in which the vehicle was reported
STATEThe state of the license plate
MAKEThe make of the vehicle
MODELThe model of the vehicle
COLORThe color of the vehicle
MODEL_YEARThe model year of the vehicle
NUMBER_OF_TRIPSNumber of trips provided in this month. Due to the complexities of matching, errors are possible in both directions. Values over 999 are converted to null as suspected error values that interfere with easy data visualization
MULTIPLE_TNPSWhether the vehicle was reported by multiple TNPs in this month. Matching is imperfect so some vehicle records that should have been combined may be separate.

⚔️ Your Goal

In this case study, you’ll use Pandas to explore and analyze over 1 million monthly rideshare vehicle registrations. Below are some of the questions you’ll be answering.

  • How many total trips did drivers make between 2015 and 2024?

  • How many trips did drivers make per month on average between 2015 and 2024?

  • Do Lincoln drivers make more trips compared to the overall average?

  • Do any drivers drive a red Tesla?

  • Do any drivers drive a Porsche or a Hummer?

  • Are there any vehicles that are brand new?

  • Do drivers who drive for two or more rideshare companies (e.g., Uber and Lyft) make more trips on average compared to other drivers who drive for only one rideshare company?

  • Which automakers are popular amongst rideshare drivers?

  • Which automakers have the highest number of trips on average?

  • How many trips do drivers of the Japanese Big 3 automakers make on average?

  • Which make/model is the most popular amongst rideshare drivers?

  • How about car colors?

Notes

  • Vehicle registrations that made less than 100 monthly trips are excluded. If you’d like to use the full dataset, you can download the full data (~500 MB) at registered vehicles dataset.

  • A vehicle will likely be registered multiple times. For example, if a driver drives a Camry for 12 months, there will be 12 rows of duplicate vehicle information.


🎯 Part 1: Import numpy and pandas

👇 Tasks

  • ✔️ Import the following Python packages.

    1. pandas: Use alias pd.

    2. numpy: Use alias np.

# YOUR CODE BEGINS

# YOUR CODE ENDS

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-01"
_points = 2

tc.assertTrue(
    "pd" in globals(), "Check whether you have correctly import Pandas with an alias."
)
tc.assertTrue(
    "np" in globals(), "Check whether you have correctly import NumPy with an alias."
)

🔭 Explore Data

📌 Load Data

▶️ Run the code cell below to configure Pandas, create a new DataFrame named df_v from a ZIP file, and clone the dataset for intermediate checks.

# DO NOT CHANGE THE CODE IN THIS CELL
# Display precision of decimal places to 1
# Precision of decimal places does not have a significance in this case study
pd.set_option("display.precision", 1)

df_v = pd.read_csv(
    "https://github.com/bdi475/datasets/blob/main/chicago-ridesharing-vehicles-202409.zip?raw=true",
    compression="zip",
)

# Used to keep a clean copy
df_v_backup = df_v.copy()

# Display the first 5 rows
df_v.head()

🎯 Part2: Find the Number of Rows and Columns in the Dataset

👇 Tasks

  • ✔️ Store the number of rows in df_v to a new variable named num_rows.

  • ✔️ Store the number of columns in df_v to a new variable named num_cols.

  • ✔️ Both num_rows and num_cols must be ints.

  • ✔️ Use .shape, not len().

# YOUR CODE BEGINS

# YOUR CODE ENDS

print(f"There are {num_rows} rows and {num_cols} columns in the dataset.")
print(
    "🐼 With over two million rows, this dataset is by far the largest one we've worked with so far!!"
)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-02"
_points = 2

tc.assertEqual(
    num_rows,
    len(df_v_backup.index),
    f"Number of rows should be {len(df_v_backup.index)}",
)
tc.assertEqual(
    num_cols,
    len(df_v_backup.columns),
    f"Number of columns should be {len(df_v_backup.columns)}",
)

🎯 Part 3: Concise Summary of the Dataset

👇 Tasks

  • ✔️ Use the info() method to print out a concise summary of df_v.

🚀 Hints

  • my_dataframe.info() prints out a concise summary of my_dataframe.

# YOUR CODE BEGINS

# YOUR CODE ENDS
# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-03"
_points = 2

🎯 Part 4: Total Number of Trips

👇 Tasks

  • ✔️ Calculate the total number of trips (monthly) made by all registered vehicles in df_v.

  • ✔️ Store the result in a new variable named total_num_trips.

  • ✔️ total_num_trips should be an int type (e.g., int, np.int64).

🚀 Hints

  • Sum all values in the NUMBER_OF_TRIPS column.

# YOUR CODE BEGINS

# YOUR CODE ENDS

print(
    f'{round(total_num_trips / (10 ** 6), 2)} million ridesharing trips have taken place in Chicago since {df_v["REPORTED_YEAR"].min()}.'
)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-04"
_points = 3

tc.assertEqual(
    total_num_trips, np.sum(df_v_backup["_".join(["nUmbEr", "oF", "TrIpS"]).upper()])
)

🎯 Part 5: Average Number of Trips (Monthly)

👇 Tasks

  • ✔️ Calculate the average number of trips (monthly) made by all registered vehicles in df_v.

  • ✔️ Store the result in a new variable named avg_num_trips.

  • ✔️ avg_num_trips should be a float type (e.g., float, np.float64).

🚀 Hints

  • Find the mean of all values in the NUMBER_OF_TRIPS column.

# YOUR CODE BEGINS

# YOUR CODE ENDS

print(
    f"On average, a registered vehicle made {round(avg_num_trips, 1)} trips each month."
)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-05"
_points = 3

tc.assertAlmostEqual(
    avg_num_trips, np.mean(df_v_backup["_".join(["nUmbEr", "oF", "TrIpS"]).upper()])
)

🎯 Part 6: Filter Lincoln Vehicles

👇 Tasks

  • ✔️ Using df_v, filter rows where the vehicle’s MAKE is "Lincoln".

    • Store the result in a new variable named df_lincoln.

  • ✔️ df_v should remain unaltered after your code.

# YOUR CODE BEGINS

# YOUR CODE ENDS

display(df_lincoln.head(5))
print(f"There are {df_lincoln.shape[0]} monthly Lincoln vehicle registrations.")

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-06"
_points = 2

tc.assertEqual(df_v.shape, df_v_backup.shape, "df_v should remain unaltered.")

pd.testing.assert_frame_equal(
    df_lincoln.sort_values(df_v_backup.columns.tolist())
    .head(100)
    .reset_index(drop=True),
    df_v_backup.query(f'{"".join(["ma", "ke"]).upper()} == "{"lIncOlN".capitalize()}"')
    .sort_values(df_v_backup.columns.tolist())
    .head(100)
    .reset_index(drop=True),
    check_like=True,
)

🎯 Part 7: Average Number of Trips (Monthly) for Lincoln Vehicles

👇 Tasks

  • ✔️ Using df_lincoln, calculate the average number of trips (monthly) for all Lincoln vehicles.

  • ✔️ Store the result in a new variable named avg_num_trips_lincoln.

  • ✔️ avg_num_trips_lincoln should be a float type (e.g., float, np.float64).

🚀 Hints

  • Find the mean of all values in df_lincoln’s NUMBER_OF_TRIPS column.

# YOUR CODE BEGINS

# YOUR CODE ENDS

print(
    f"Lincoln drivers made {round(avg_num_trips_lincoln, 1)} monthly trips on average."
)
print(
    f"This is higher than the overall average of {round(avg_num_trips, 1)} trips (monthly)."
)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-07"
_points = 3

tc.assertAlmostEqual(
    avg_num_trips_lincoln,
    df_v_backup.query(f'{"".join(["mA", "Ke"]).upper()} == "{"LiNCoLN".capitalize()}"')[
        "_".join(["NuMBeR", "oF", "tRiPs"]).upper()
    ].mean(),
)

🎯 Part 8: Red Tesla Vehicles

👇 Tasks

  • ✔️ Using df_v, filter rows where:

    • MAKE is "Tesla", AND

    • COLOR is "Red".

  • ✔️ Store the result in a new variable named df_red_tesla.

  • ✔️ df_v should remain unaltered after your code.

🚀 Hints

  • Use the logical AND operator (&) to check whether the two conditions are both satisfied.

# YOUR CODE BEGINS

# YOUR CODE ENDS

print(f"There are {df_red_tesla.shape[0]} monthly red Tesla vehicle registrations.")
df_red_tesla.head(5)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-08"
_points = 4

tc.assertEqual(df_v.shape, df_v_backup.shape, "df_v should remain unaltered.")

pd.testing.assert_frame_equal(
    df_v_backup.query(
        f'{"".join(["mA", "kE"]).upper()} == "{"".join(["T", "e", "s", "l", "a"])}" \
                                          & {"".join(["c", "OLO", "r"]).upper()} == "{("r" + "E" + "D").capitalize()}"'
    )
    .sort_values(df_v_backup.columns.tolist())
    .reset_index(drop=True),
    df_red_tesla.sort_values(df_red_tesla.columns.tolist()).reset_index(drop=True),
)

🎯 Part 9: Porsche + Hummer

👇 Tasks

  • ✔️ Using df_v, filter rows where:

    • MAKE is "Porsche", OR

    • MAKE is "Hummer".

  • ✔️ Store the result in a new variable named df_porsche_hummer.

  • ✔️ df_v should remain unaltered after your code.

🚀 Hints

  • Use the logical OR operator (|) to check whether at least one of the two conditions is met.

# YOUR CODE BEGINS

# YOUR CODE ENDS

display(df_porsche_hummer.head(5))
print(
    f"There are {df_porsche_hummer.shape[0]} Porsche and Hummer monthly vehicle registrations."
)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-09"
_points = 4

tc.assertEqual(df_v.shape, df_v_backup.shape, "df_v should remain unaltered.")

pd.testing.assert_frame_equal(
    df_v_backup.query(
        f'{"mAKe".upper()} == "{"PORsCHE".capitalize()}" \
                                      | {"mAke".upper()} == "{"HuMMeR".capitalize()}"'
    )
    .sort_values(df_v_backup.columns.tolist())
    .reset_index(drop=True),
    df_porsche_hummer.sort_values(df_v_backup.columns.tolist()).reset_index(drop=True),
)

🎯 Part 10: Maximum Number of Trips

👇 Tasks

  • ✔️ Find the maximum number of monthly trips for a given vehicle in df_v.

  • ✔️ Store the value in a new variable named max_num_trips.

🚀 Hints

  • Find the maximum value of the "NUMBER_OF_TRIPS" column.

# YOUR CODE BEGINS

# YOUR CODE ENDS

print(f"The maximum reported number of monthly trips is {max_num_trips}.")

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-10"
_points = 3

tc.assertEqual(max_num_trips, 999)

🎯 Part 11: Black Lincoln Mkz with 999+ Trips

👇 Tasks

  • ✔️ Using df_v, filter rows where:

    • MODEL is "Mkz", AND

    • COLOR is "Black", AND

    • NUMBER_OF_TRIPS is greater than or equal to 999.

  • ✔️ Store the result in a new variable named df_black_mkz.

  • ✔️ df_v should remain unaltered after your code.

🚀 Hints

  • Use the logical AND operator (&) with parentheses to check whether two or more conditions are satisfied.

  • Example: df_filtered = df[(condition1) & (condition2) & (condition3)]

🔑 Expected Output

Your index column may contain different values.

REPORTED_YEARREPORTED_MONTHSTATEMAKEMODELCOLORMODEL_YEARNUMBER_OF_TRIPSMULTIPLE_TNPS
53524720181ILLincolnMkzBlack2017999True
95181620193ILLincolnMkzBlack2016999True
108369620198ILLincolnMkzBlack2018999True
127828020202ILLincolnMkzBlack2017999True
1573809202111ILLincolnMkzBlack2020999True
# YOUR CODE BEGINS

# YOUR CODE ENDS

display(df_black_mkz)
print(
    f"There are {df_black_mkz.shape[0]} black Mkzs that have made 999+ trips in a given month."
)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-11"
_points = 4

tc.assertEqual(df_v.shape, df_v_backup.shape, "df_v should remain unaltered.")

pd.testing.assert_frame_equal(
    df_v_backup.query(
        f'{"mOdEL".upper()} == "{"mKz".capitalize()}" \
                                      & {"cOlOr".upper()} == "{"bLaCK".capitalize()}" \
                                      & {"_".join(["nUMBEr", "OF", "TripS"]).upper()} > (100 * 10 - 2)'
    )
    .sort_values(df_v_backup.columns.tolist())
    .reset_index(drop=True),
    df_black_mkz.sort_values(df_v_backup.columns.tolist()).reset_index(drop=True),
)

🎯 Part 12: Latest Model Year Vehicles

👇 Tasks

  • ✔️ Using df_v, filter rows where the vehicle’s "MODEL_YEAR" (NOT REPORTED_YEAR) is 2025 or newer (>=2025).

  • ✔️ Store the result in a new variable named df_latest_models.

  • ✔️ df_v should remain unaltered after your code.

🔑 Expected Output

Your index column may contain different values.

REPORTED_YEARREPORTED_MONTHSTATEMAKEMODELCOLORMODEL_YEARNUMBER_OF_TRIPSMULTIPLE_TNPS
224873420245ILToyotaCamrySilver2025238False
224985020245ILToyotaCamrySilver2025187False
225298020245ILToyotaCamryWhite2025265False
226093920245ILToyotaCamrySilver2025127False
226234120245ILToyotaCamrySilver2025169False
# YOUR CODE BEGINS

# YOUR CODE ENDS

display(df_latest_models.head(5))
print(f"There are {df_latest_models.shape[0]} latest model registrations.")

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-12"
_points = 3

tc.assertEqual(df_v.shape, df_v_backup.shape, "df_v should remain unaltered.")

pd.testing.assert_frame_equal(
    df_v_backup.query(f'{"_".join(["MoDeL", "YeaR"]).upper()} > (1010 * 2 + 4)')
    .sort_values(df_v_backup.columns.tolist())
    .reset_index(drop=True),
    df_latest_models.sort_values(df_v_backup.columns.tolist()).reset_index(drop=True),
)

🎯 Part 13: Compare Average Number of Trips by MULTIPLE_TNPS

A driver can choose to drive for only one ridesharing service (e.g., only for Uber) or choose to drive for two or more services (e.g., for both Uber and Lyft).

MULTIPLE_TNPS column indicates whether a driver is registered with multiple rideshare services. The MULTIPLE_TNPS value of a registered vehicle will be:

  • False if the driver is only registered for one service (e.g., only for Uber)

  • True if the driver is registered for two or more services (e.g., both Uber and Lyft)

Question: Do drivers make more trips on average if they are registered with multiple rideshare services?

Your goal is to answer that question using data. 📐

👇 Tasks

  • ✔️ Using df_v, calculate the following two metrics:

    1. Average number of trips by vehicles where MULTIPLE_TNPS is False.

      • ✏️ Store this in a new variable named avg_num_trips_single_tnps.

      • avg_num_trips_single_tnps should be a float-typed variable.

    2. Average number of trips by vehicles where MULTIPLE_TNPS is True.

      • ✏️ Store this in a new variable named avg_num_trips_multiple_tnps.

      • avg_num_trips_multiple_tnps should be a float-typed variable.

  • ✔️ df_v should remain unaltered after your code.

🚀 Hints

  • There are many ways to do this. The process described below is only one of them.

  • Create the following two DataFrames using filter.

    1. df_single_tnps: Only contains rows where MULTIPLE_TNPS is False.

    2. df_multiple_tnps: Only contains rows where MULTIPLE_TNPS is True.

  • For each of the two DataFrames you’ve created, calculate the average of NUMBER_OF_TRIPS.

    • Example: df_single_tnps['NUMBER_OF_TRIPS'].mean()

# YOUR CODE BEGINS

# YOUR CODE ENDS

print(
    f"Drivers driving for a single rideshare service made {round(avg_num_trips_single_tnps, 1)} trips (monthly) on average."
)
print(
    f"Drivers driving for multiple rideshare services made {round(avg_num_trips_multiple_tnps, 1)} trips (monthly) on average."
)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-13"
_points = 6

tc.assertEqual(df_v.shape, df_v_backup.shape, "df_v should remain unaltered.")

col_check = "_".join(["MulTiPle", "".join(["T", "n", "P", "s"])]).upper()
col_metric = "_".join(
    ["".join(["NuM", "beR"]), "oF", "".join(["T", "r", "I", "P", "s"])]
).upper()

tc.assertAlmostEqual(
    avg_num_trips_single_tnps, np.mean(df_v.query(f"~{col_check}")[col_metric])
)
tc.assertAlmostEqual(
    avg_num_trips_multiple_tnps, np.mean(df_v.query(f"{col_check}")[col_metric])
)

🔮 Make/Model/Color Analysis

So far, we’ve looked at individual vehicle registrations. In this section, we will aggregate vehicles by one or more criteria.


🎯 Part 14: Create a Summary of Metrics by Vehicle Manufacturer

Our first level of aggregation is the vehicle manufacturer ("MAKE"). For each vehicle manufacturer, you’ll be calculating the total number of registered vehicles ("NUM_REG", the total number of trips ("TOTAL_NUM_TRIPS"), and the average number of trips ("AVG_NUM_TRIPS").

👇 Tasks

  • ✔️ Copy the code from the image above into the code cell below.

    • 😺 Try to understand what the code does.

  • ✔️ df_v should remain unaltered after your code.

  • ✔️ We will provide you with the code for this deliverable.

Groupby make
# YOUR CODE BEGINS

# YOUR CODE ENDS

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-14"
_points = 3

df_check = (
    df_v_backup.groupby("".join(["m", "ak", "e"]).upper())
    .agg({"number_of_trips".upper(): ["".join(["c", "ou", "n", "t"]), "sum", "mean"]})
    .reset_index()
)
df_check.columns = ["AVG_NUM_TRIPS", "TOTAL_NUM_TRIPS", "NUM_REG", "MAKE"][::-1]
df_check = df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True)
df_by_make_backup = df_check.copy()

tc.assertEqual(df_v.shape, df_v_backup.shape, "df_v should remain unaltered.")

pd.testing.assert_frame_equal(
    df_check, df_by_make.sort_values(df_by_make.columns.tolist()).reset_index(drop=True)
)

🎯 Part 15: Sort by Number of Registered Vehicles

Can you sort df_by_make by popularity (i.e., number of registrations)?

👇 Tasks

  • ✔️ Sort df_by_make by the "NUM_REG" column in descending order.

  • ✔️ Store the sorted result in a new variable named df_by_make_popularity.

  • ✔️ df_by_make should remain unaltered after your code (out-of-place sort).

    • In other words, your code should store the sorted result in df_by_make_popularity without affecting df_by_make.

🔑 Expected Output

Your index column may contain different values.

MAKENUM_REGTOTAL_NUM_TRIPSAVG_NUM_TRIPS
49Toyota702188188895048269.0
35Nissan25840860317213233.4
16Honda21414550758111237.0
8Chevrolet17525937702454215.1
18Hyundai16347439390000241.0
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_by_make_popularity.head(5)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-15"
_points = 3

df_check = df_by_make_backup.sort_values(
    "_".join(["num", "reg"]).upper(), ascending=bool(0), inplace=bool(0)
)

pd.testing.assert_frame_equal(
    df_by_make_backup.sort_values(df_by_make_backup.columns.tolist()).reset_index(
        drop=True
    ),
    df_by_make.sort_values(df_by_make.columns.tolist()).reset_index(drop=True),
    "df_by_make should remain unaltered.",
)

pd.testing.assert_series_equal(
    df_check["NUM_REG"].reset_index(drop=True),
    df_by_make_popularity["NUM_REG"].reset_index(drop=True),
)

🎯 Part 16: Sort by Average Number of Trips (Monthly)

👇 Tasks

  • ✔️ Sort df_by_make from 🎯 Part 14 (NOT df_by_make_popularity) by the "AVG_NUM_TRIPS" column in descending order.

  • ✔️ Store the sorted result in a new variable named df_by_make_trips.

  • ✔️ Drop the "NUM_REG" and "TOTAL_NUM_TRIPS" columns from df_by_make_trips.

    • After your code runs, df_by_make_trips should only have two columns - "MAKE" and "AVG_NUM_TRIPS".

  • ✔️ df_by_make should remain unaltered after your code.

    • In other words, your code should (1) store the sorted result in df_by_make_trips and (2) drop "NUM_REG" and "TOTAL_NUM_TRIPS" columns without affecting df_by_make.

  • ⚠️ Warning: Running your code more than once may throw an error because you cannot drop a column that has already been dropped. Despite the error message, this is not a huge problem as it won’t have any effect on your DataFrame. The error message will go away if you go back a few steps and restart with a fresh df_by_make.

🔑 Expected Output

Your index column may contain different values.

MAKEAVG_NUM_TRIPS
6Byd288.1
37Polestar287.5
48Tesla287.5
28Mercedes269.5
49Toyota269.0
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_by_make_trips.head(5)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-16"
_points = 5

df_check = df_by_make_backup.sort_values(
    "_".join(["avg", "num", "trips"]).upper(), ascending=bool(0), inplace=bool(0)
).drop(columns=["NUM_REG", "TOTAL_NUM_TRIPS"])

pd.testing.assert_frame_equal(
    df_check.reset_index(drop=True), df_by_make_trips.reset_index(drop=True)
)

🎯 Part 17: Average Number of Trips of Japanese Big 3

👇 Tasks

  • ✔️ Using df_by_make_trips, retrieve the following three values.

    1. avg_num_trips_toyota: Average number of trips (monthly) of Toyota vehicles

    2. avg_num_trips_nissan: Average number of trips (monthly) of Nissan vehicles

    3. avg_num_trips_honda: Average number of trips (monthly) of Honda vehicles

  • ✔️ All 3 variables must be float-typed.

  • ✔️ df_by_make_trips should remain unaltered after your code.

  • ⚠️ You must perform this task programmatically. You cannot manually enter the numbers (e.g., avg_num_trips_toyota = 269.0 is NOT a correct answer).

🚀 Hints

To retrieve the value of major1 column of Dylan from df_you, you can use the following code:

df_you[df_you['name'] == 'Dylan']['major1'].iloc[0]

The code above will...

  1. Filter row(s) where name is "Dylan".

  2. Retrieve the major1 column as a Pandas Series.

  3. Get the value of the first row in the Series (iloc[0] returns the value of the first row).

# YOUR CODE BEGINS

# YOUR CODE ENDS

print(
    f"Toyota vehicles have an average number of {round(avg_num_trips_toyota, 1)} trips (monthly)."
)
print(
    f"Nissan vehicles have an average number of {round(avg_num_trips_nissan, 1)} trips (monthly)."
)
print(
    f"Honda vehicles have an average number of {round(avg_num_trips_honda, 1)} trips (monthly)."
)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-17"
_points = 6

df_check = (
    df_v_backup.groupby("MAKE")
    .agg(
        {
            "number_of_trips".upper(): [
                "".join(["c", "ou", "n", "t"]),
                "sum",
                "MEAN".lower(),
            ]
        }
    )
    .reset_index()
)
df_check.columns = ["AVG_NUM_TRIPS", "TOTAL_NUM_TRIPS", "NUM_REG", "MAKE"][::-1]

col_check = "_".join(["avg", "num", "trips"]).upper()
make_check = list(map(lambda x: x.capitalize(), ["toYOta", "niSSan", "hONDa"]))
var_check = [avg_num_trips_toyota, avg_num_trips_nissan, avg_num_trips_honda]

for i in range(len(var_check)):
    tc.assertAlmostEqual(
        var_check[i], df_check.query(f'MAKE == "{make_check[i]}"')[col_check].iloc[0]
    )

🎯 Part 18: Create a Summary of Metrics by Vehicle Model

Our second level of aggregation is the vehicle manufacturer ("MAKE") and model ("MODEL"). For each vehicle model, you’ll be calculating the total number of registered vehicles ("NUM_REG", the total number of trips ("TOTAL_NUM_TRIPS"), and the average number of trips ("AVG_NUM_TRIPS").

Again, we will provide you with the code for this deliverable.

Groupby make

👇 Tasks

  • ✔️ Copy the code from the image above into the code cell below.

    • 😺 Try to understand what the code does.

  • ✔️ df_v should remain unaltered after your code.

# YOUR CODE BEGINS

# YOUR CODE ENDS

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-18"
_points = 3

df_check = (
    df_v_backup.groupby(["MODEL", "MAKE"][::-1])
    .agg(
        {
            "number_of_trips".upper(): [
                "".join(["c", "ou", "n", "t"]),
                "SuM".lower(),
                "MeaN".lower(),
            ]
        }
    )
    .reset_index()
)
df_check.columns = ["AVG_NUM_TRIPS", "TOTAL_NUM_TRIPS", "NUM_REG", "MODEL", "MAKE"][
    ::-1
]

tc.assertEqual(df_v.shape, df_v_backup.shape, "df_v should remain unaltered.")

pd.testing.assert_frame_equal(
    df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True),
    df_by_model.sort_values(df_by_model.columns.tolist()).reset_index(drop=True),
)

🎯 Part 19: Rename Columns

👇 Tasks

  • ✔️ Rename the following columns in df_by_model:

    1. "MAKE" to "manufacturer"

    2. "MODEL" to "model"

    3. "NUM_REG" to "registrations"

    4. "AVG_NUM_TRIPS" to "avg_trips"

  • ⚠️ Don’t worry about the "TOTAL_NUM_TRIPS" column. You’ll drop this column soon.

🚀 Hints

To rename a column named "col1" to "col1_renamed" and "col2" to "col2_renamed" in my_dataframe, use the following code.

rename inplace

🔑 Expected Output

Your index column may contain different values.

MAKEMODELNUM_REGTOTAL_NUM_TRIPSAVG_NUM_TRIPS
0AcuraIlx1593339824213.3
1AcuraIntegra7214113196.0
2AcuraMdx173164076282235.4
3AcuraMdx Hybrid13432447242.1
4AcuraRdx2975672284226.0
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_by_model.head(5)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-19"
_points = 3

df_check = (
    df_v_backup.groupby(["MODEL", "MAKE"][::-1])
    .agg(
        {
            "number_of_trips".upper(): [
                "".join(["c", "ou", "n", "t"]),
                "sum",
                "meaN".lower(),
            ]
        }
    )
    .reset_index()
)
df_check.columns = [
    "avg_trips",
    "TOTAL_NUM_TRIPS",
    "registrations",
    "model",
    "manufacturer",
][::-1]

pd.testing.assert_frame_equal(
    df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True),
    df_by_model.sort_values(df_by_model.columns.tolist()).reset_index(drop=True),
)

🎯 Part 20: Drop TOTAL_NUM_TRIPS Column

👇 Tasks

  • ✔️ Drop the "TOTAL_NUM_TRIPS" column from df_by_model.

  • ⚠️ Warning: Running your code more than once may throw an error because you cannot drop a column that has already been dropped. Despite the error message, this is not a huge problem as it won’t have any effect on your DataFrame. The error message will go away if you go back a few steps and restart with a fresh df_by_model.

🔑 Expected Output

Your index column may contain different values.

manufacturermodelregistrationsavg_trips
0AcuraIlx1593213.3
1AcuraIntegra72196.0
2AcuraMdx17316235.4
3AcuraMdx Hybrid134242.1
4AcuraRdx2975226.0
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_by_model.head(5)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-20"
_points = 2

df_check = (
    df_v_backup.groupby(["MODEL", "MAKE"][::-1])
    .agg(
        {
            "number_of_trips".upper(): [
                "".join(["c", "ou", "n", "t"]),
                "".join(["me", "an"]),
            ]
        }
    )
    .reset_index()
)
df_check.columns = list(
    map(
        lambda x: x.lower(),
        ["AVg_TriPS", "ReGisTraTiONS", "MoDeL", "ManUfacTurer"][::-1],
    )
)

pd.testing.assert_frame_equal(
    df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True),
    df_by_model.sort_values(df_by_model.columns.tolist()).reset_index(drop=True),
)

🎯 Part 21: Sort by Number of Registered Vehicles

Can you sort df_by_model by popularity?

👇 Tasks

  • ✔️ Sort df_by_model by the "registrations" column in descending order.

  • ✔️ Update df_by_model without creating a new variable.

    • In other words, you’re performing an in-place sort.

🔑 Expected Output

Your index column may contain different values.

manufacturermodelregistrationsavg_trips
664ToyotaCamry206733266.5
686ToyotaPrius116086286.5
669ToyotaCorolla113473259.3
565NissanAltima82660237.1
580NissanSentra74164241.4
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_by_model.head(5)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-21"
_points = 2

df_check = (
    df_v_backup.groupby(["MODEL", "MAKE"][::-1])
    .agg({"number_of_trips".upper(): ["".join(["c", "o", "u", "n", "t"]), "mean"]})
    .sort_values(("NUMBER_OF_TRIPS", "count"), ascending=False)
    .reset_index()
)
df_check.columns = list(
    map(
        lambda x: x.lower(),
        ["AVg_TriPS", "ReGisTraTiONS", "MoDeL", "ManUfacTurer"][::-1],
    )
)

pd.testing.assert_series_equal(
    df_check["registrations"].reset_index(drop=True),
    df_by_model["registrations"].reset_index(drop=True),
)
pd.testing.assert_frame_equal(
    df_check.sort_values(
        ["registrations", "manufacturer", "model"], ascending=[False, True, True]
    ).reset_index(drop=True),
    df_by_model.sort_values(
        ["registrations", "manufacturer", "model"], ascending=[False, True, True]
    ).reset_index(drop=True),
)

🎯 Part 22: Number of Lincoln LS Registrations

👇 Tasks

  • ✔️ Using df_by_model, retrieve the number of Lincoln LS registrations.

    • manufacturer == "Lincoln"

    • model == "Ls" (case-sensitive)

  • ✔️ Store the result in a new variable named num_lincoln_ls_registrations.

  • ✔️ num_lincoln_ls_registrations should be an int type.

  • ✔️ df_by_model should remain unaltered after your code.

  • ⚠️ Filtering by the model column only (model == 'Ls') may yield incorrect results as Lexus also has a model with the same name (Lexus LS). Use both manufacturer and model columns to find the matching row.

🚀 Hints

To retrieve the "fav_restaurant" column value of a person where name == 'Josh' and city == 'Chicago', use the code below.

df_you[(df_you['name'] == 'Josh') & (df_you['city'] == 'Chicago')]['fav_restaurant'].iloc[0]

The code above will...

  1. Filter row(s) where name is "Josh" and city is "Chicago".

  2. Retrieve the fav_restaurant column as a Pandas Series.

  3. Get the value of the first element from the Series (iloc[0] returns the value of the element at index position 0, which is the first element).

# YOUR CODE BEGINS

# YOUR CODE ENDS

print(f"There are {num_lincoln_ls_registrations} Lincoln Ls registrations.")

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-22"
_points = 2

df_check = (
    df_v_backup.groupby(["MODEL", "MAKE"][::-1])
    .agg({"number_of_trips".upper(): "".join(["c", "o", "u", "n", "t"])})
    .reset_index()
)

tc.assertEqual(
    num_lincoln_ls_registrations,
    df_check.query(
        f'MAKE == "{"lInCoLn".capitalize()}" & MODEL == "{"lS".capitalize()}"'
    )["_".join(["nUmBer", "oF", "tRiPs"]).upper()].iloc[0],
)

🎯 Part 23: Number of Ford 500 Registrations

This part is very similar to the previous one.

👇 Tasks

  • ✔️ Using df_by_model, retrieve the number of Ford 500 registrations.

    • manufacturer == 'Ford'

    • model == '500' (⚠️ string '500', not number 500)

  • ✔️ Store the result in a new variable named num_ford_500_registrations.

  • ✔️ num_ford_500_registrations should be an int type.

  • ✔️ df_by_model should remain unaltered after your code.

  • ⚠️ Filtering by the model column only (model == '500') may yield incorrect results as Fiat also has a model with the same name (Fiat 500). Use both manufacturer and model columns to find the matching row.

# YOUR CODE BEGINS

# YOUR CODE ENDS

print(f"There are {num_ford_500_registrations} Ford 500 registrations.")

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-23"
_points = 2

df_check = (
    df_v_backup.groupby(["MODEL", "MAKE"][::-1])
    .agg({"number_of_trips".upper(): "".join(["c", "o", "u", "n", "t"])})
    .reset_index()
)

tc.assertEqual(
    num_ford_500_registrations,
    df_check.query(f'MAKE == "{"fOrD".capitalize()}" & MODEL == "{5 * 5 * 4 * 5}"')[
        "_".join(["nUmBer", "oF", "tRiPs"]).upper()
    ].iloc[0],
)

🎯 Part 24: Create a Summary of Metrics by Vehicle Model and Color

Our final level of aggregation is the vehicle manufacturer ("MAKE"), model ("MODEL"), and color ("COLOR").

👇 Tasks

  • ✔️ Copy the code from the image above into the code cell below.

    • 😺 Try to understand what the code does.

  • ✔️ df_v should remain unaltered after your code.

  • ✔️ Again, we will provide you with the code for this deliverable.

Groupby make

🔑 Expected Output

Your index column may contain different values.

Note that the COLOR column has not been cleaned and contains double quotes in some cells.

MAKEMODELCOLORREPORTED_MONTH
0AcuraIlx"1
1AcuraIlx“Grey”1
2AcuraIlxBlack749
3AcuraIlxBlue86
4AcuraIlxBrown14
5AcuraIlxBurgundy12
# YOUR CODE BEGINS

# YOUR CODE ENDS

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-24"
_points = 2

df_check = (
    df_v_backup.groupby(["COLOR", "MODEL", "MAKE"][::-1])
    .agg({"nUMbEr_of_trIPs".upper(): "".join(["c", "ou", "n", "t"])})
    .reset_index()
)
df_check.columns = ["REPORTED_MONTH", "COLOR", "MODEL", "MAKE"][::-1]

tc.assertEqual(df_v.shape, df_v_backup.shape, "df_v should remain unaltered.")

pd.testing.assert_frame_equal(
    df_check.sort_values(df_check.columns.tolist()).reset_index(drop=True),
    df_by_color.sort_values(df_by_color.columns.tolist()).reset_index(drop=True),
)

🎯 Part 25: Rename Columns

👇 Tasks

  • ✔️ Rename the following columns in df_by_color in place:

    1. "MAKE" to "manufacturer"

    2. "MODEL" to "model"

    3. "COLOR" to "color"

    4. "REPORTED_MONTH" to "registrations"

🔑 Expected Output

Your index column may contain different values.

manufacturermodelcolorregistrations
0AcuraIlx"1
1AcuraIlx“Grey”1
2AcuraIlxBlack749
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_by_color.head(3)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-25"
_points = 3

df_check = (
    df_v_backup.groupby(["COLOR", "MODEL", "MAKE"][::-1])
    .agg({"number_of_trips".upper(): "".join(["c", "ou", "n", "t"])})
    .reset_index()
)
df_check.columns = list(
    map(lambda x: x.lower(), ["RegIstraTioNs", "CoLoR", "MoDeL", "ManUFacTuReR"][::-1])
)

tc.assertEqual(df_by_color.shape, df_check.shape)
tc.assertEqual(df_by_color.columns.tolist(), df_check.columns.tolist())

🎯 Part 26: Sort by Number of Registered Vehicles

👇 Tasks

  • ✔️ Sort df_by_color by the "registrations" column in descending order.

  • ✔️ Update df_by_color without creating a new variable.

    • In other words, you’re performing an in-place sort.

🔑 Expected Output

Your index column may contain different values.

manufacturermodelcolorregistrations
8272ToyotaCamryBlack65331
8313ToyotaCamrySilver49194
8321ToyotaCamryWhite35804
8447ToyotaCorollaSilver31255
8691ToyotaPriusSilver28866
# YOUR CODE BEGINS

# YOUR CODE ENDS

df_by_color.head(5)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.

  • ❌ If the code cell produces an error, review your code and fix any mistakes.

# DO NOT CHANGE THE CODE IN THIS CELL
_test_case = "part-26"
_points = 3

df_check = (
    df_v_backup.groupby(["COLOR", "MODEL", "MAKE"][::-1])
    .agg({"number_of_trips".upper(): "".join(["c", "ou", "n", "t"])})
    .reset_index()
)
df_check.columns = list(
    map(lambda x: x.lower(), ["RegIstraTioNs", "CoLoR", "MoDeL", "ManUFacTuReR"][::-1])
)

pd.testing.assert_series_equal(
    df_check["registrations"].sort_values(ascending=False).reset_index(drop=True),
    df_by_color["registrations"].reset_index(drop=True),
)
pd.testing.assert_frame_equal(
    df_check.sort_values(["manufacturer", "model", "color"]).reset_index(drop=True),
    df_by_color.sort_values(["manufacturer", "model", "color"]).reset_index(drop=True),
)

🍸 Submitting Your Notebook

There is one final step before exporting the notebook as an .ipynb file for submission. Restart your runtime (kernal) and run all cells from the beginning to ensure that your notebook does not contain any errors.

  • Go to the “Runtime” menu (“Kernel” if you’re on Jupyter Lab) on top.

  • Select “Restart and run all”.

  • Make sure no code cell throws an error.

  • Failing to pass this step may result in a significant loss of points since the autograder will fail to run.

image.png
print("🎯 Restart and run all successful")