Skip to article frontmatterSkip to article content

Introduction to Data Visualization

Welcome to the final and the coolest part of this course! 😎 I’m not saying that the previous topics aren’t cool. But visualizations are the bridges that can connect your data analytics skills with the outside world. The last thing you want to do is print out a large DataFrame to report your findings.

Data visualization translates raw data into visual stories that reveal patterns, trends, and relationships. Humans process visual information far faster than text or tables—so effective visualizations help analysts think, communicate, and persuade.

There is no such thing as information overload. There is only bad design.

Edward Tufte


✨ From Data to Visualization: A Workflow

  1. Define the Question: What decision or hypothesis are you exploring?
  2. Acquire and Clean Data: Remove noise, handle missing values.
  3. Choose the Visual Form: Match data type (categorical, continuous, time-series) to appropriate chart.
  4. Design and Annotate: Add titles, captions, and labels.
  5. Iterate: Test different layouts and get user feedback.

📊 Types of Data Visualization

Exploratory vs. Explanatory

Exploratory vs Explanatory

When we visualize data, we are not simply drawing charts - we are searching for patterns. Most quantitative insights can be grouped into five broad categories of data patterns. Recognizing these patterns helps analysts choose the most effective visual representation for their message.

Data patterns

1️⃣ Change

  • Definition: Shows how a variable evolves over time.
  • Examples: Line charts, area charts, or bar charts that track trends.
  • Use When: You want to highlight growth, decline, or seasonal fluctuations.

2️⃣ Clustering

  • Definition: Reveals natural groupings or segments within data.
  • Examples: Scatter plots or bubble charts showing customer segments, product groups, or behavioral clusters.
  • Use When: You want to explore differences and similarities between observations.

3️⃣ Relativity

  • Definition: Displays how parts relate to a whole.
  • Examples: Pie charts, donut charts, or stacked bar charts.
  • Use When: You want to emphasize proportions or contribution to a total.

4️⃣ Ranking

  • Definition: Compares ordered categories to identify leaders or laggards.
  • Examples: Horizontal bar charts, lollipop charts, or sorted column charts.
  • Use When: You want to show the top or bottom performers (e.g., top 10 sales regions).

5️⃣ Correlation

  • Definition: Illustrates relationships between two or more quantitative variables.
  • Examples: Scatter plots, correlation matrices, or regression lines.
  • Use When: You want to determine whether changes in one variable are associated with changes in another.

🧪 Why It Matters

Identifying these five data patterns helps analysts choose the right visualization for their story. Rather than focusing on chart types first, start by asking:

“What kind of pattern am I trying to show - change, clustering, relativity, ranking, or correlation?”

🗂️ Common Chart Types

Chart TypeBest ForAvoid When
Bar ChartComparing categorical valuesToo many categories
Line ChartShowing trends over timeNon-sequential categories
Scatter PlotRevealing relationships between two variablesToo few data points
HistogramShowing data distributionComparing multiple groups
Box PlotSummarizing distribution & outliersSmall samples
Pie / Donut ChartShowing parts of a wholeMany small slices
HeatmapDisplaying matrix or correlation patternsHard-to-read color scales
TreeMap / SunburstHierarchical proportionsNeed precise comparisons

🧾 Common Pitfalls in Data Visualization

PitfallDescriptionBetter Practice
Overuse of 3DDistorts proportionsStick to 2D
Too Many ColorsConfuses audienceUse ≤ 5 meaningful colors
Truncated Y-AxisMisleads differencesStart at 0 for bar charts
Dense DashboardsCognitive overloadPrioritize key visuals
Unlabeled AxesAmbiguous meaningAlways label variables

📚 Dataviz libraries

The mostly commonly used library for data visualization in introductory data analytics courses is matplotlib, a low-level visualization library for Python. seaborn is another popular library built on top of matplotlib that provides a higher-level interface for creating attractive and informative statistical graphics.

Below are some of the most popular and battle-tested data visualization libraries - all of which are free and open source:

  • matplotlib: Low-level visualization library for Python
  • seaborn: High-level visualization library for Python built on matplotlib
  • bokeh: Interactive visualizations for modern web browsers
  • plotnine: Python implementation of ggplot2
  • plotly: Interactive visualization library supporting Python, JavaScript, and R
  • altair: Declarative visualization library for Python based on Vega

🎨 Plot.ly

We’ll use plotly, which provides both low-level and high-level interfaces to create publication-ready graphs.

plotly logo image

To use plotly in JupyterLab, install the jupyterlab and anywidget packages in the same environment as you installed plotly, using pip inside a terminal:

pip install plotly anywidget

or conda:

conda install plotly anywidget

You can use the exclamation mark ! to run shell commands directly from a Jupyter notebook cell:

!pip install plotly anywidget

or

!conda install plotly anywidget

It’s generally a better practice to run installation commands in a terminal rather than inside a Jupyter notebook to avoid environment issues.


🛠️ Exercises using an HR Dataset

▶️ Import the following Python packages.

  1. pandas: Use alias pd.
  2. numpy: Use alias np.
  3. plotly.express: Use alias px.
  4. plotly.graph_objects: Use alias go.
import pandas as pd
import numpy as np

import plotly.graph_objects as go
import plotly.express as px

▶️ Check the version of plotly installed in your environment.

import plotly

print(f"Plotly version: {plotly.__version__}")
Plotly version: 6.3.1

Today, we work with an HR Dataset to uncover insights about HR metrics, measurement, and analytics. The data has been downloaded from https://rpubs.com/rhuebner/hr_codebook_v14 without any modification.

▶️ Import the HR Dataset. 🐷👧👨🏻‍🦰👩🏼‍🦳👳🏽‍♂️👩🏾‍🦲🐼.

# Display all columns
pd.set_option("display.max_columns", 50)

df_hr = pd.read_csv("https://github.com/bdi475/datasets/raw/main/HR-dataset-v14.csv")

display(df_hr)
Loading...

📦 Box Plot

Box plots divide the data into 4 sections that each contain 25% of the data. It is useful to quickly identify the distribution of the data based on Q1, Q2 (median), and Q3.

box plot explanation

▶️ Create a simple box plot of 12 different GPAs. NumPy is used here to calculate the statistical figures.

gpa = np.array(
    [3.33, 2.67, 3.0, 3.67, 3.67, 2.33, 3.0, 3.0, 2.67, 4.0, 3.33, 2.67, 4.0]
)
gpa
array([3.33, 2.67, 3. , 3.67, 3.67, 2.33, 3. , 3. , 2.67, 4. , 3.33, 2.67, 4. ])
fig = px.box(x=gpa, title="GPA Distribution (Horizontal Box Plot)")
fig.show()
Loading...
print(f"Mean: {np.mean(gpa)}")
print(f"Median: {np.median(gpa)}")
print(f"Q1: {np.quantile(gpa, 0.25)}")
print(f"Q3: {np.quantile(gpa, 0.75)}")
print(f"IQR: {np.quantile(gpa, 0.75) - np.quantile(gpa, 0.25)}")
Mean: 3.18
Median: 3.0
Q1: 2.67
Q3: 3.67
IQR: 1.0

🗺️ Findings

  • Median is 3.
  • Minimum is 2.33.
  • Maximum is 4.
  • Interquartile range is 1.
    • You can calculate this value by subtracting Q1 from Q3: 3.67 - 2.67.
  • There is a positive skew.
    • This is also shown by comparing the mean and the median.

🎯 Example 1: Salary box plot (vertical)

▶️ Draw a vertical box plot of Salary in df_hr.

fig = px.box(df_hr, y="Salary", title="Salary Distribution (Vertical)")
fig.show()
Loading...

🎯 Example 2: Salary box plot (horizontal)

▶️ Draw a horizontal box plot of Salary.

fig = px.box(df_hr, x="Salary", title="Salary Distribution (Horizontal)")
fig.show()
Loading...

🎯 Example 3: Salary distribution by citizenship status

▶️ Draw horizontal box plots of Salary by CitizenDesc.

fig = px.box(
    df_hr,
    x="Salary",
    y="CitizenDesc",
    title="Salary Distribution by Citizenship Status",
)
fig.show()
Loading...

🎯 Example 4: Salary distribution by performance

▶️ Draw horizontal box plots of Salary by PerformanceScore.

# YOUR CODE BEGINS
fig = px.box(
    df_hr,
    x="Salary",
    y="PerformanceScore",
    title="Salary Distribution by Performance Score",
)
fig.show()
# YOUR CODE ENDS
Loading...

🎯 Example 4: Salary distribution by department

▶️ Draw horizontal box plots of Salary by Department.

# YOUR CODE BEGINS
fig = px.box(
    df_hr,
    x="Salary",
    y="Department",
    title="Salary Distribution by Department",
    height=600,
)
fig.show()
# YOUR CODE ENDS
Loading...

🧶 Histogram

Histograms display frequency distributions using bars of different heights.

Here is an example histogram showing the distribution of 500 random integers following a normal distribution.

fig = px.histogram(x=np.random.randn(500))
fig.show()
Loading...

🎯 Example 4: Salary histogram

▶️ Draw a histogram of Salary in df_hr.

fig = px.histogram(df_hr, x="Salary", title="Salary Distribution")
fig.show()
Loading...

🎯 Example 4: Number of absences histogram

▶️ Draw a histogram of Absences in df_hr.

fig = px.histogram(df_hr, x="Absences", title="Number of Absence Distribution")
fig.show()
Loading...

🎯 Example 8: Salary histograms by gender

▶️ Draw overlaid histograms of Salary in df_hr by GenderID.

fig = go.Figure()
fig.add_trace(go.Histogram(x=df_hr[df_hr["GenderID"] == 0]["Salary"], name="Male"))

fig.add_trace(go.Histogram(x=df_hr[df_hr["GenderID"] == 1]["Salary"], name="Female"))

# Overlay both histograms
fig.update_layout(barmode="overlay")

# Reduce opacity to see both histograms
fig.update_traces(opacity=0.6)
fig.show()
Loading...