More Plotly Visualizations - BDI 475 Textbook

▶️ Import the following Python packages.

pandas: Use alias pd.
numpy: Use alias np.
plotly.express: Use alias px.
plotly.graph_objects: Use alias go.

import pandas as pd
import numpy as np

import plotly.graph_objects as go
import plotly.express as px

📌 Import dataset¶

Today, we work with bikesharing trips dataset 🚲 to uncover insights about trips made by subscribers and casual riders of Bluebikes (in Boston). The original dataset has been downloaded from https://www.bluebikes.com/system-data and was preprocessed for this exercise.

▶️ Import the dataset. This dataset is a fairly large with ~2 million rows, so it may take up to a few minutes.

# Display all columns
pd.set_option("display.max_columns", 50)

df_trips = pd.read_csv(
    "https://github.com/bdi475/datasets/blob/main/bluebikes-trip-data-2020-sampled.csv.gz?raw=true",
    compression="gzip",
    parse_dates=["start_time", "stop_time"],
)

df_trips_backup = df_trips.copy()

display(df_trips)

📦 Box plots and histograms review¶

▶️ Create a box plot of the trip duration for trips less than 30 minutes.

fig = px.box(
    df_trips[df_trips["trip_duration"] < 1800],
    x="trip_duration",
    title="Trip Duration in Seconds (for trips shorter than 30 minutes)",
)
fig.show()
# YOUR CODE ENDS

▶️ Create a box plot of the trip duration for trips by user type.

fig = px.box(
    df_trips,
    x="trip_duration",
    y="user_type",
    title="Trip Duration in Seconds by User Type",
)
fig.show()
# YOUR CODE ENDS

▶️ Create a histogram of the trip duration with 36 bins.

fig = px.histogram(
    df_trips, x="trip_duration", title="Trip Duration Distribution", nbins=36
)
fig.show()
# YOUR CODE ENDS

📈 Line chart¶

A line chart is used to visualize data points connected by lines. It is particularly useful for showing trends over time or continuous data.

▶️ Import a gold price dataset for a simple demo.

df_gold = pd.read_csv(
    "https://github.com/bdi475/datasets/raw/main/gold-annual-closing-price.csv"
)
df_gold

▶️ Create the line chart.

fig = px.line(
    df_gold, x="Year", y="Closing Price", title="Annual Closing Price of Gold"
)
fig.show()

▶️ Create an aggregated DataFrame with number of trips by date.

df_num_trips_by_date = df_trips.groupby(
    df_trips["start_time"].dt.date, as_index=False
).size()

df_num_trips_by_date.rename(
    columns={"start_time": "date", "size": "num_trips"}, inplace=True
)

display(df_num_trips_by_date)

▶️ Create a line chart that displays the number of trips by date.

fig = px.line(
    df_num_trips_by_date,
    x="date",
    y="num_trips",
    title="Number of Trips by Date in 2020",
)
fig.show()
# YOUR CODE ENDS

🟢 Scatter plot¶

A scatter plot is used to visualize individual data points on a two-dimensional plane. It is particularly useful for showing the relationship between two variables.

▶️ Create a scatter plot that displays the number of trips by date.

fig = px.scatter(
    df_num_trips_by_date,
    x="date",
    y="num_trips",
    title="Number of Trips by Date in 2020",
)
fig.show()
# YOUR CODE ENDS

▶️ Create an aggregated DataFrame with number of trips by date & user type.

df_num_trips_by_date_and_user_type = df_trips.groupby(
    [df_trips["start_time"].dt.date, "user_type"], as_index=False
).size()

df_num_trips_by_date_and_user_type.rename(
    columns={"start_time": "date", "size": "num_trips"}, inplace=True
)

display(df_num_trips_by_date_and_user_type)
# YOUR CODE ENDS

▶️ create a line chart that displays the number of trips by date.

fig = px.line(
    df_num_trips_by_date_and_user_type,
    x="date",
    y="num_trips",
    color="user_type",
    title="Number of Trips by Date and User Type in 2020",
)
fig.show()
# YOUR CODE ENDS

▶️ Create a scatter plot that displays the number of trips by date.

fig = px.scatter(
    df_num_trips_by_date_and_user_type,
    x="date",
    y="num_trips",
    color="user_type",
    title="Number of Trips by Date and User Type in 2020",
)
fig.show()
# YOUR CODE ENDS

📊 Bar chart¶

A bar chart is used to represent categorical data with rectangular bars. The length of each bar is proportional to the value it represents. Bar charts are useful for comparing different categories or groups.

There are two types of bar charts: vertical and horizontal. In vertical bar charts, the bars extend vertically from the x-axis, while in horizontal bar charts, the bars extend horizontally from the y-axis. Vertical bar charts are sometimes called column charts.

▶️ Create an aggregated DataFrame with number of trips by month.

df_num_trips_by_month = df_trips.groupby(
    df_trips["start_time"].dt.month, as_index=False
).size()

df_num_trips_by_month.rename(
    columns={"start_time": "month", "size": "num_trips"}, inplace=True
)

display(df_num_trips_by_month)
# YOUR CODE ENDS

▶️ create a bar chart that displays the number of trips by month.

fig = px.bar(
    df_num_trips_by_month,
    x="month",
    y="num_trips",
    title="Number of Trips by Month in 2020",
)
fig.show()
# YOUR CODE ENDS

🌐 Exploring more Plotly visualizations¶

▶️ Import Chicago Airbnb listings dataset.

We will use this dataset to create other types of charts.

df_listings = pd.read_csv(
    "https://github.com/bdi475/datasets/raw/main/case-studies/airbnb-sql/Chicago.csv"
)
df_listings_backup = df_listings.copy()
df_listings.head(3)

▶️ Sample 100 listings with price under $200.

df_under_200_sample = df_listings[df_listings["price"] < 200].sample(100)

🔮 3D scatter plot¶

A 3D scatter plot is used to visualize data points in three-dimensional space. It allows you to see the relationships between three variables simultaneously. Each point in the plot represents a data point with three coordinates (x, y, z).

A 3D visualization requires interactive capabilities to rotate and explore the data from different angles, which Plotly provides.

▶️ Create a 3D scatter plot with the following axes: - x: Number of bedrooms - y: Number of bathrooms - z: Price.

fig = px.scatter_3d(
    df_under_200_sample,
    title="Bedrooms, Bathrooms, Price 3D Scatter Plot",
    x="bedrooms",
    y="bathrooms",
    z="price",
    color="room_type",
    template="plotly_dark",
    width=800,
    height=600,
)
fig.show()
# YOUR CODE ENDS

▶️ Find the top 20 neighbourhoods by number of listings.

top_20_neighbourhoods = (
    df_listings["neighbourhood"].value_counts().head(20).index.tolist()
)

top_20_neighbourhoods

['West Town',
 'Lake View',
 'Logan Square',
 'Near North Side',
 'Lincoln Park',
 'Near West Side',
 'Lower West Side',
 'Edgewater',
 'Uptown',
 'North Center',
 'Irving Park',
 'Loop',
 'Avondale',
 'Rogers Park',
 'Near South Side',
 'Bridgeport',
 'Lincoln Square',
 'Grand Boulevard',
 'Hyde Park',
 'Armour Square']

▶️ Filter listings in the top 20 neighbourhoods.

df_filtered = df_listings[
    (df_listings["neighbourhood"].isin(top_20_neighbourhoods))
    & (df_listings["price"] < 300)
]

🥧 Pie chart¶

A pie chart is a circular statistical graphic that is divided into slices to illustrate numerical proportions. Each slice represents a category’s contribution to the whole, making it easy to see relative sizes at a glance. Pie charts are best used when you want to show parts of a whole and compare proportions among categories.

▶️ Create a pie chart that shows the distribution of listings across the top 20 neighbourhoods.

fig = px.pie(
    df_filtered,
    names="neighbourhood",
    title="Neighbourhood breakdown",
    width=800,
    height=700,
)

fig.show()
# YOUR CODE ENDS

▶️ Find the aggregated statistics by neighbourhood and room type.

df_by_neighbourhood_room_type = (
    df_filtered.groupby(["neighbourhood", "room_type"], as_index=False)
    .agg(
        {
            "name": "count",
            "bedrooms": "mean",
            "bathrooms": "mean",
            "accommodates": "mean",
            "price": "mean",
        }
    )
    .rename(columns={"name": "num_listings"})
)

display(df_by_neighbourhood_room_type.head(5))

🌳 Treemap chart¶

A treemap is a visualization that displays hierarchical data using nested rectangles. Each rectangle represents a category or subcategory, and its size is proportional to a specific value, such as count or sum. Treemaps are useful for visualizing large datasets with multiple levels of hierarchy, allowing you to see the relative sizes of different categories at a glance.

▶️ Create a treemap chart that shows the distribution of listings across the top 20 neighbourhoods.

fig = px.treemap(
    df_by_neighbourhood_room_type,
    path=["neighbourhood"],
    title="Top 20 neighbourhoods breakdown",
    values="num_listings",
    height=700,
)

fig.show()
# YOUR CODE ENDS

A treemap can also represent multiple levels of hierarchy. For example, we can visualize both neighbourhoods and room types within each neighbourhood.

▶️ Create a treemap chart that shows the distribution of listings across the top 20 neighbourhoods, broken down by room type.

fig = px.treemap(
    df_by_neighbourhood_room_type,
    path=["neighbourhood", "room_type"],
    title="Top 20 neighbourhoods breakdown",
    values="num_listings",
    height=700,
)

fig.show()
# YOUR CODE ENDS

🌞 Sunburst chart¶

A sunburst chart is a radial visualization that displays hierarchical data using concentric circles. Each level of the hierarchy is represented by a ring, with the innermost circle representing the root node and outer rings representing child nodes. Sunburst charts are useful for visualizing hierarchical relationships and proportions within a dataset.

Think of a sunburst chart as a circular version of a treemap, where the size of each segment corresponds to a specific value, such as count or sum. Or you can think of it as a pie chart with multiple levels, where each level represents a different layer of the hierarchy.

▶️ Create a sunburst chart that shows the distribution of listings across the top 20 neighbourhoods, broken down by room type.

fig = px.sunburst(
    df_by_neighbourhood_room_type,
    path=["neighbourhood", "room_type"],
    title="Listings Breakdown by Neighbourhood and Room Type",
    values="num_listings",
    width=800,
    height=800,
)

fig.show()
# YOUR CODE ENDS

🔙 Back to the basics¶

While these advanced visualizations can be powerful, it’s essential to remember the importance of clarity and simplicity in data visualization. Always consider your audience and the message you want to convey when choosing the type of chart to use. Sometimes, sticking to basic charts like bar charts or line charts can be more effective in communicating your insights clearly.

▶️ Create a bar chart that shows the distribution of listings across the top 20 neighbourhoods, broken down by room type.

fig = px.bar(
    df_by_neighbourhood_room_type,
    x="num_listings",
    y="neighbourhood",
    color="room_type",
    template="plotly_dark",
    title="Listings Breakdown by Neighbourhood and Room Type",
    height=600,
)

fig.update_yaxes(categoryorder="total ascending")

fig.show()
# YOUR CODE ENDS

🟩 Best Practices¶

Pie Chart Best Practices¶

Use pie charts only to show parts of a meaningful whole (100%); avoid when values are not comparable.
Limit slices (preferably ≤ 6); group small categories into “Other”.
Order slices by size (clockwise) to make comparison easier.
Show both percentage and absolute values (labels or hover) for clarity.
Prefer a donut chart when you need a center label or extra context.
Avoid 3D effects, shadows, or exploded views that distort perception.
Use a clear, colorblind-friendly palette and maintain high contrast between slices.
Don’t use multiple pie charts to compare categories over time - use bar/column or stacked charts instead.
Add a legend or direct labels; avoid overcrowding labels - use hover tooltips for detail.

Treemap Chart Best Practices¶

Use treemaps to show hierarchical composition and relative sizes simultaneously.
Limit the number of top-level categories; collapse/aggregate very small leaves into “Other”.
Size rectangles by a single primary metric and use color to encode a secondary metric (with a clear color scale legend).
Sort children by size to make layout more interpretable.
Label rectangles with category name + value (or percentage); use tooltips for longer text and extra metrics.
Ensure minimum rectangle size threshold so very small items aren’t unreadable; consider filtering them out.
Maintain consistent color mapping across related visuals for comparability.
Prefer treemaps for overviews; use zoom/interaction to explore deeper levels rather than showing too many nested levels at once.
Provide a caption or legend explaining how size and color map to metrics.

Sunburst Chart Best Practices¶

Use sunbursts for hierarchical data when the radial layout helps show levels and proportions.
Limit depth to 2–3 levels for readability; deeper hierarchies become hard to interpret.
Size segments by a primary value and use color to show a categorical or quantitative secondary metric.
Order segments meaningfully (e.g., by size or logical grouping) to improve pattern recognition.
Display percentages relative to parent and/or total (in tooltips) so viewers understand proportions.
Avoid many thin outer segments - aggregate or filter very small categories.
Provide a clear center label (root) and a legend or annotation explaining color encoding.
Use interactive hover/zoom to reveal details; for static outputs prefer treemap or stacked bars if clarity suffers.
Choose accessible color palettes and ensure good contrast between adjacent segments.

Tools Used in This Course

Introduction to Data Visualization

Tools Used in This Course

Natural Language Processing with Python