- 🏆 25 points available
▶️ First, run the code below to import unittest, a module used for 🧭 Check Your Work sections and the autograder.
# DO NOT MODIFY THE CODE IN THIS CELL
import unittest
tc = unittest.TestCase()▶️ Run the code below to import pandas and numpy.
import pandas as pd
import numpy as np🗃️ Dataset¶
You will be working with two datasets in this notebook:
df_companiescontains some basic information about various companies.df_infocontains additional information about the companie.
df_companies = pd.read_csv(
"https://raw.githubusercontent.com/bdi475/datasets/refs/heads/main/fake-companies.csv"
)
df_companies_backup = df_companies.copy()
df_companies.head()df_info = pd.read_csv(
"https://raw.githubusercontent.com/bdi475/datasets/refs/heads/main/fake-company-info.csv"
)
df_info_backup = df_info.copy()
df_info.head()🎯 Challenge 1: Find the Number of Companies in the Dataset¶
👇 Tasks¶
- ✔️ Find the number of companies (i.e., number of rows) in the
df_companiesDataFrame. - ✔️ Store the result to a new variable named
num_companies.
# YOUR CODE BEGINS
# YOUR CODE ENDS
print(f"There are {num_companies} companies in the dataset.")🧭 Check Your Work¶
Run the code cell below to test your solution.
- ✔️ If the code cell runs without errors, you’re good to move on.
- ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "num_companies"
_points = 1
tc.assertEqual(num_companies, df_companies_backup.index.size)🎯 Challenge 2: Find the Number of Profitable Companies¶
👇 Tasks¶
- ✔️ Find the number of profitable companies in the
df_companiesDataFrame.- A company is considered profitable if the
"is_profitable"column isTrue.
- A company is considered profitable if the
- ✔️ Store the result to a new variable named
num_profitable_companies.
# YOUR CODE BEGINS
# YOUR CODE ENDS
print(f"There are {num_profitable_companies} profitable companies in the dataset.")🧭 Check Your Work¶
Run the code cell below to test your solution.
- ✔️ If the code cell runs without errors, you’re good to move on.
- ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "num-profitable-companies"
_points = 3
tc.assertEqual(
num_profitable_companies,
df_companies_backup.query("_".join(["Is", "Profitable"]).lower()).index.size,
)🎯 Challenge 3: Filter Companies without a Country¶
👇 Tasks¶
- ✔️ Filter the
df_companiesDataFrame to find all companies where thecountrycolumn is missing (i.e., containsnp.nanvalues). - ✔️ Store the resulting DataFrame to a new variable named
df_missing_country.
📤 Expected Output¶
| company_id | company_name | country | is_profitable | branches | |
|---|---|---|---|---|---|
| 120 | 528219 | Stein Group | nan | True | 4 |
| 222 | 146245 | Lopez-Scott | nan | True | 1 |
# YOUR CODE BEGINS
# YOUR CODE ENDS
display(df_missing_country)🧭 Check Your Work¶
Run the code cell below to test your solution.
- ✔️ If the code cell runs without errors, you’re good to move on.
- ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "filter-companies-without-country-values"
_points = 3
pd.testing.assert_frame_equal(
df_missing_country.sort_values("company_id").reset_index(drop=True),
df_companies_backup.query("~~~country.notna()")
.sort_values("company_id")
.reset_index(drop=True),
check_dtype=False,
check_column_type=False,
)▶️ Run the code below to merge df_info into df_companies.
We provide this code for you, so you don’t need to modify it.
# DO NOT CHANGE THE CODE IN THIS CELL
# Merge df_info into df_companies
# fmt: off
df = pd.merge(
left=df_companies,
right=df_info,
on="company_id",
how="inner"
)
df_backup = df.copy()
original_columns = df.columns.tolist() # for grading purposes
# fmt: on
print(f"The merged DataFrame has {df.shape[0]} rows and {df.shape[1]} columns.")The code above merges the two DataFrames using an inner join on the "company_id" column. The resulting DataFrame df contains all columns from both DataFrames.
For the remaining challenges, you will work with the merged DataFrame df.
▶️ Display the first three rows of the merged DataFrame df to understand its structure.
df.head(3)🎯 Challenge 4: Filter and Sort Indian Companies¶
👇 Tasks¶
- ✔️ Filter the
dfDataFrame to find all companies where the"country"column is “India”. - ✔️ Store the filtered DataFrame to a new variable named
df_indian_companies. - ✔️ Sort
df_indian_companiesby the"revenue"column in descending order.
# YOUR CODE BEGINS
# YOUR CODE ENDS
display(df_indian_companies)🧭 Check Your Work¶
Run the code cell below to test your solution.
- ✔️ If the code cell runs without errors, you’re good to move on.
- ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "filter-and-sort-indian-companies"
_points = 3
pd.testing.assert_frame_equal(
df_indian_companies[original_columns].reset_index(drop=True),
df_backup[original_columns]
.query("country == '" + "in".capitalize() + "DIA'".lower())
.sort_values("revenue")[::-1]
.reset_index(drop=True),
)🎯 Challenge 5: Convert "inception_date" to a DateTime Type and Extract Inception Year¶
👇 Tasks¶
- ✔️ Convert the
"inception_date"column in thedfDataFrame to a DateTime type (datetime64[ns]).- Hint: Use the
pd.to_datetime()method.
- Hint: Use the
- ✔️ Add a new column named
"inception_year"to the DataFrame that extracts the year from the"inception_date"column.- Hint: Use the
.dtaccessor to extract the year from a DateTime column. - The
"inception_year"column should contain integer values representing the year (e.g.,2018,2019, etc.).
- Hint: Use the
- ✔️ Directly update the
dfDataFrame without creating a new DataFrame.
📤 Expected Output¶
| inception_date | inception_year | |
|---|---|---|
| 0 | 2015-08-01 | 2015 |
| 1 | 2012-11-29 | 2012 |
| 2 | 2018-05-18 | 2018 |
inception_date dtype: datetime64[ns]
inception_year dtype: int32# YOUR CODE BEGINS
# YOUR CODE ENDS
# Select and display the relevant columns to verify the changes
display(df[["inception_date", "inception_year"]].head(3))
print(f"inception_date dtype: {df['inception_date'].dtype}")
print(f"inception_year dtype: {df['inception_year'].dtype}")🧭 Check Your Work¶
Run the code cell below to test your solution.
- ✔️ If the code cell runs without errors, you’re good to move on.
- ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "extract-inception-year"
_points = 3
df_backup2 = df_backup.copy()
df_backup2["inception_date"] = df_backup2["inception_date"].astype(
"date".upper().lower() + "TIME".lower() + "64[ns]"
)
my_dtfunc = pd.to_datetime
df_backup2["inception_year"] = my_dtfunc(df_backup2["inception_date"]) \
.dt.year
pd.testing.assert_frame_equal(
df[["company_id", "inception_date", "inception_year"]]
.sort_values("company_id")
.reset_index(drop=True),
df_backup2[["company_id", "inception_date", "inception_year"]]
.sort_values("company_id")
.reset_index(drop=True),
check_dtype=False,
check_column_type=False,
)🎯 Challenge 6: Filter and Sort Companies with an Inception Year of 2018¶
👇 Tasks¶
- ✔️ Filter the
dfDataFrame to find all companies where the inception year is 2018. - ✔️ Store the filtered DataFrame to a new variable named
df_companies_2018. - ✔️ Sort
df_companies_2018by the"inception_date"column in ascending order.
📤 Expected Output¶
| company_id | company_name | country | is_profitable | branches | ceo | industry | revenue | inception_date | inception_year | |
|---|---|---|---|---|---|---|---|---|---|---|
| 210 | 151345 | Wilson, Wright and Paul | Liberia | True | 3 | David Stewart | Agriculture | 4741676 | 2018-01-01 | 2018 |
| 134 | 760106 | Bennett-Ellis | Jamaica | True | 3 | Tracy Rivera | Agriculture | 5201671 | 2018-01-07 | 2018 |
| 41 | 353038 | Hester Inc | Saint Pierre and Miquelon | True | 5 | Karen Woods | Software | 7394055 | 2018-01-09 | 2018 |
# YOUR CODE BEGINS
# YOUR CODE ENDS
# Display the first 3 rows of the resulting DataFrame
display(df_companies_2018.head(3))🧭 Check Your Work¶
Run the code cell below to test your solution.
- ✔️ If the code cell runs without errors, you’re good to move on.
- ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "companies-with-inception-year-2018"
_points = 2
pd.testing.assert_frame_equal(
df_companies_2018.reset_index(drop=True),
df_backup2.query("inception_" + "YEAR".lower() + " == (1009 * 2)")
.sort_values("inception_date")[::-1][::-1]
.reset_index(drop=True),
)🎯 Challenge 7: Total Revenue by Country¶
👇 Tasks¶
- ✔️ Group the
dfDataFrame by the"country"column and calculate the total (sum) revenue for each country.- The resulting DataFrame should have two columns:
"country"and"revenue".
- The resulting DataFrame should have two columns:
- ✔️ Store the resulting DataFrame to a new variable named
df_revenue_by_country. - ✔️ Sort
df_revenue_by_countryby the total revenue in descending order.
📤 Expected Output¶
| country | revenue | |
|---|---|---|
| 118 | Saint Kitts and Nevis | 30231545 |
| 92 | Malta | 28409854 |
| 16 | Bosnia and Herzegovina | 25415016 |
# YOUR CODE BEGINS
# YOUR CODE ENDS
# Display the first 3 rows of the resulting DataFrame
display(df_revenue_by_country.head(3))🧭 Check Your Work¶
Run the code cell below to test your solution.
- ✔️ If the code cell runs without errors, you’re good to move on.
- ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "total-revenue-by-country"
_points = 3
pd.testing.assert_frame_equal(
df_revenue_by_country.reset_index(drop=True),
df_backup2.pivot_table(index="country", values="REVENUE".lower(), aggfunc="sum")
.reset_index()
.sort_values("REVENUE".lower())[::-1]
.reset_index(drop=True),
)🎯 Challenge 8: Number of Companies by Industry¶
👇 Tasks¶
- ✔️ Group the
dfDataFrame by the"industry"column and calculate the number of companies in each industry.- The resulting DataFrame should have two columns:
"industry"and"num_companies".
- The resulting DataFrame should have two columns:
- ✔️ Store the resulting DataFrame to a new variable named
df_num_companies_by_industry. - ✔️ Sort
df_num_companies_by_industryby the number of companies in descending order.
📤 Expected Output¶
| industry | num_companies | |
|---|---|---|
| 0 | Agriculture | 64 |
| 2 | Financial Services | 52 |
| 3 | Manufacturing | 47 |
# YOUR CODE BEGINS
# YOUR CODE ENDS
# Display the first 3 rows of the resulting DataFrame
display(df_num_companies_by_industry.head(3))🧭 Check Your Work¶
Run the code cell below to test your solution.
- ✔️ If the code cell runs without errors, you’re good to move on.
- ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "number-of-companies-by-industry"
_points = 4
pd.testing.assert_frame_equal(
df_num_companies_by_industry.reset_index(drop=True),
df_backup2.pivot_table(
index="industry", values="COMPANY_ID".lower(), aggfunc="count"
)
.rename({"company_id": "num_companies"}, axis=1)
.reset_index()
.sort_values("num_companies", ascending=False)
.reset_index(drop=True),
)🎯 Challenge 9: Filter Companies that are Not Profitable with 6 or More Branches¶
👇 Tasks¶
- ✔️ Filter the
dfDataFrame to find all companies that are not profitable (i.e., where the"is_profitable"column isFalse) and have 6 or more branches (i.e., where the"branches"column is greater than or equal to 6). - ✔️ Store the resulting DataFrame to a new variable named
df_no_profit. - ✔️ You do not need to sort the resulting DataFrame.
📤 Expected Output¶
| company_id | company_name | country | is_profitable | branches | ceo | industry | revenue | inception_date | inception_year | |
|---|---|---|---|---|---|---|---|---|---|---|
| 9 | 296886 | Pearson Ltd | France | False | 7 | Gina Buckley | Agriculture | 7896239 | 2014-04-19 | 2014 |
# YOUR CODE BEGINS
# YOUR CODE ENDS
# Display the first 3 rows of the resulting DataFrame
display(df_no_profit.head(3))🧭 Check Your Work¶
Run the code cell below to test your solution.
- ✔️ If the code cell runs without errors, you’re good to move on.
- ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "no-profit-companies-with-6-or-more-branches"
_points = 3
pd.testing.assert_frame_equal(
df_no_profit.reset_index(drop=True),
df_backup2.query("~IS_PROFITABLE & (BRANCHES >= 6)".lower()).reset_index(drop=True),
)