Skip to article frontmatterSkip to article content

Quiz 3 Prep - Venture Capital Exit Analysis

  • 🏆 25 points available

▶️ First, run the code below to import unittest, a module used for 🧭 Check Your Work sections and the autograder.

# DO NOT MODIFY THE CODE IN THIS CELL
import unittest

tc = unittest.TestCase()

▶️ Run the code below to import pandas and numpy.

import pandas as pd
import numpy as np

🗃️ Dataset

You will be working with two datasets in this notebook:

  • df_companies contains some basic information about various companies.
  • df_info contains additional information about the companie.
df_companies = pd.read_csv(
    "https://raw.githubusercontent.com/bdi475/datasets/refs/heads/main/fake-companies.csv"
)
df_companies_backup = df_companies.copy()
df_companies.head()
df_info = pd.read_csv(
    "https://raw.githubusercontent.com/bdi475/datasets/refs/heads/main/fake-company-info.csv"
)
df_info_backup = df_info.copy()
df_info.head()

🎯 Challenge 1: Find the Number of Companies in the Dataset

👇 Tasks

  • ✔️ Find the number of companies (i.e., number of rows) in the df_companies DataFrame.
  • ✔️ Store the result to a new variable named num_companies.
# YOUR CODE BEGINS

# YOUR CODE ENDS

print(f"There are {num_companies} companies in the dataset.")

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.
  • ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "num_companies"
_points = 1

tc.assertEqual(num_companies, df_companies_backup.index.size)

🎯 Challenge 2: Find the Number of Profitable Companies

👇 Tasks

  • ✔️ Find the number of profitable companies in the df_companies DataFrame.
    • A company is considered profitable if the "is_profitable" column is True.
  • ✔️ Store the result to a new variable named num_profitable_companies.
# YOUR CODE BEGINS

# YOUR CODE ENDS

print(f"There are {num_profitable_companies} profitable companies in the dataset.")

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.
  • ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "num-profitable-companies"
_points = 3

tc.assertEqual(
    num_profitable_companies,
    df_companies_backup.query("_".join(["Is", "Profitable"]).lower()).index.size,
)

🎯 Challenge 3: Filter Companies without a Country

👇 Tasks

  • ✔️ Filter the df_companies DataFrame to find all companies where the country column is missing (i.e., contains np.nan values).
  • ✔️ Store the resulting DataFrame to a new variable named df_missing_country.

📤 Expected Output

company_idcompany_namecountryis_profitablebranches
120528219Stein GroupnanTrue4
222146245Lopez-ScottnanTrue1
# YOUR CODE BEGINS

# YOUR CODE ENDS

display(df_missing_country)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.
  • ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "filter-companies-without-country-values"
_points = 3

pd.testing.assert_frame_equal(
    df_missing_country.sort_values("company_id").reset_index(drop=True),
    df_companies_backup.query("~~~country.notna()")
    .sort_values("company_id")
    .reset_index(drop=True),
    check_dtype=False,
    check_column_type=False,
)

▶️ Run the code below to merge df_info into df_companies.

We provide this code for you, so you don’t need to modify it.

# DO NOT CHANGE THE CODE IN THIS CELL
# Merge df_info into df_companies
# fmt: off
df = pd.merge(
    left=df_companies, 
    right=df_info, 
    on="company_id", 
    how="inner"
)
df_backup = df.copy()
original_columns = df.columns.tolist() # for grading purposes
# fmt: on

print(f"The merged DataFrame has {df.shape[0]} rows and {df.shape[1]} columns.")

The code above merges the two DataFrames using an inner join on the "company_id" column. The resulting DataFrame df contains all columns from both DataFrames.

For the remaining challenges, you will work with the merged DataFrame df.

▶️ Display the first three rows of the merged DataFrame df to understand its structure.

df.head(3)

🎯 Challenge 4: Filter and Sort Indian Companies

👇 Tasks

  • ✔️ Filter the df DataFrame to find all companies where the "country" column is “India”.
  • ✔️ Store the filtered DataFrame to a new variable named df_indian_companies.
  • ✔️ Sort df_indian_companies by the "revenue" column in descending order.
# YOUR CODE BEGINS

# YOUR CODE ENDS

display(df_indian_companies)

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.
  • ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "filter-and-sort-indian-companies"
_points = 3

pd.testing.assert_frame_equal(
    df_indian_companies[original_columns].reset_index(drop=True),
    df_backup[original_columns]
    .query("country == '" + "in".capitalize() + "DIA'".lower())
    .sort_values("revenue")[::-1]
    .reset_index(drop=True),
)

🎯 Challenge 5: Convert "inception_date" to a DateTime Type and Extract Inception Year

👇 Tasks

  • ✔️ Convert the "inception_date" column in the df DataFrame to a DateTime type (datetime64[ns]).
    • Hint: Use the pd.to_datetime() method.
  • ✔️ Add a new column named "inception_year" to the DataFrame that extracts the year from the "inception_date" column.
    • Hint: Use the .dt accessor to extract the year from a DateTime column.
    • The "inception_year" column should contain integer values representing the year (e.g., 2018, 2019, etc.).
  • ✔️ Directly update the df DataFrame without creating a new DataFrame.

📤 Expected Output

inception_dateinception_year
02015-08-012015
12012-11-292012
22018-05-182018
inception_date dtype: datetime64[ns]
inception_year dtype: int32
# YOUR CODE BEGINS

# YOUR CODE ENDS

# Select and display the relevant columns to verify the changes
display(df[["inception_date", "inception_year"]].head(3))
print(f"inception_date dtype: {df['inception_date'].dtype}")
print(f"inception_year dtype: {df['inception_year'].dtype}")

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.
  • ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "extract-inception-year"
_points = 3

df_backup2 = df_backup.copy()
df_backup2["inception_date"] = df_backup2["inception_date"].astype(
    "date".upper().lower() + "TIME".lower() + "64[ns]"
)
my_dtfunc = pd.to_datetime
df_backup2["inception_year"] = my_dtfunc(df_backup2["inception_date"]) \
    .dt.year

pd.testing.assert_frame_equal(
    df[["company_id", "inception_date", "inception_year"]]
    .sort_values("company_id")
    .reset_index(drop=True),
    df_backup2[["company_id", "inception_date", "inception_year"]]
    .sort_values("company_id")
    .reset_index(drop=True),
    check_dtype=False,
    check_column_type=False,
)

🎯 Challenge 6: Filter and Sort Companies with an Inception Year of 2018

👇 Tasks

  • ✔️ Filter the df DataFrame to find all companies where the inception year is 2018.
  • ✔️ Store the filtered DataFrame to a new variable named df_companies_2018.
  • ✔️ Sort df_companies_2018 by the "inception_date" column in ascending order.

📤 Expected Output

company_idcompany_namecountryis_profitablebranchesceoindustryrevenueinception_dateinception_year
210151345Wilson, Wright and PaulLiberiaTrue3David StewartAgriculture47416762018-01-012018
134760106Bennett-EllisJamaicaTrue3Tracy RiveraAgriculture52016712018-01-072018
41353038Hester IncSaint Pierre and MiquelonTrue5Karen WoodsSoftware73940552018-01-092018
# YOUR CODE BEGINS

# YOUR CODE ENDS

# Display the first 3 rows of the resulting DataFrame
display(df_companies_2018.head(3))

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.
  • ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "companies-with-inception-year-2018"
_points = 2

pd.testing.assert_frame_equal(
    df_companies_2018.reset_index(drop=True),
    df_backup2.query("inception_" + "YEAR".lower() + " == (1009 * 2)")
    .sort_values("inception_date")[::-1][::-1]
    .reset_index(drop=True),
)

🎯 Challenge 7: Total Revenue by Country

👇 Tasks

  • ✔️ Group the df DataFrame by the "country" column and calculate the total (sum) revenue for each country.
    • The resulting DataFrame should have two columns: "country" and "revenue".
  • ✔️ Store the resulting DataFrame to a new variable named df_revenue_by_country.
  • ✔️ Sort df_revenue_by_country by the total revenue in descending order.

📤 Expected Output

countryrevenue
118Saint Kitts and Nevis30231545
92Malta28409854
16Bosnia and Herzegovina25415016
# YOUR CODE BEGINS

# YOUR CODE ENDS

# Display the first 3 rows of the resulting DataFrame
display(df_revenue_by_country.head(3))

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.
  • ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "total-revenue-by-country"
_points = 3

pd.testing.assert_frame_equal(
    df_revenue_by_country.reset_index(drop=True),
    df_backup2.pivot_table(index="country", values="REVENUE".lower(), aggfunc="sum")
    .reset_index()
    .sort_values("REVENUE".lower())[::-1]
    .reset_index(drop=True),
)

🎯 Challenge 8: Number of Companies by Industry

👇 Tasks

  • ✔️ Group the df DataFrame by the "industry" column and calculate the number of companies in each industry.
    • The resulting DataFrame should have two columns: "industry" and "num_companies".
  • ✔️ Store the resulting DataFrame to a new variable named df_num_companies_by_industry.
  • ✔️ Sort df_num_companies_by_industry by the number of companies in descending order.

📤 Expected Output

industrynum_companies
0Agriculture64
2Financial Services52
3Manufacturing47
# YOUR CODE BEGINS

# YOUR CODE ENDS

# Display the first 3 rows of the resulting DataFrame
display(df_num_companies_by_industry.head(3))

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.
  • ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "number-of-companies-by-industry"
_points = 4

pd.testing.assert_frame_equal(
    df_num_companies_by_industry.reset_index(drop=True),
    df_backup2.pivot_table(
        index="industry", values="COMPANY_ID".lower(), aggfunc="count"
    )
    .rename({"company_id": "num_companies"}, axis=1)
    .reset_index()
    .sort_values("num_companies", ascending=False)
    .reset_index(drop=True),
)

🎯 Challenge 9: Filter Companies that are Not Profitable with 6 or More Branches

👇 Tasks

  • ✔️ Filter the df DataFrame to find all companies that are not profitable (i.e., where the "is_profitable" column is False) and have 6 or more branches (i.e., where the "branches" column is greater than or equal to 6).
  • ✔️ Store the resulting DataFrame to a new variable named df_no_profit.
  • ✔️ You do not need to sort the resulting DataFrame.

📤 Expected Output

company_idcompany_namecountryis_profitablebranchesceoindustryrevenueinception_dateinception_year
9296886Pearson LtdFranceFalse7Gina BuckleyAgriculture78962392014-04-192014
# YOUR CODE BEGINS

# YOUR CODE ENDS

# Display the first 3 rows of the resulting DataFrame
display(df_no_profit.head(3))

🧭 Check Your Work

Run the code cell below to test your solution.

  • ✔️ If the code cell runs without errors, you’re good to move on.
  • ❌ If the code cell produces an error, review your code and fix any mistakes.
_test_case = "no-profit-companies-with-6-or-more-branches"
_points = 3

pd.testing.assert_frame_equal(
    df_no_profit.reset_index(drop=True),
    df_backup2.query("~IS_PROFITABLE & (BRANCHES >= 6)".lower()).reset_index(drop=True),
)