Introduction to Pandas

Pandas is a powerful and popular open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. Pandas is built on top of NumPy, another fundamental package for numerical computing in Python, and it is widely used in data science, machine learning, and statistical analysis.

📜 History of Pandas¶

Pandas was created by Wes McKinney for financial analysis in 2008 and has since become one of the most essential tools for data scientists and analysts. The name “Pandas” is derived from “Panel Data,” which refers to multidimensional data sets commonly used in econometrics.

Pandas is popular for several reasons:

🗂️ Data Structures: Pandas provides two primary data structures: Series (1-dimensional) and DataFrame (2-dimensional). These structures are flexible and allow for easy manipulation of data.
🪒 Data Manipulation: Pandas offers a wide range of functions for data cleaning, transformation, and analysis. It supports operations like filtering, grouping, merging, reshaping, and handling missing data.
🔗 Integration: Pandas integrates well with other libraries in the Python ecosystem, such as NumPy, Matplotlib, and Scikit-learn, making it a versatile tool for data science workflows.
🏃🏿‍♀️ Performance: Pandas is optimized for performance and can handle large datasets efficiently, making it suitable for real-world data analysis tasks.
📚 Community and Documentation: Pandas has a large and active community, which contributes to its development and provides extensive documentation and tutorials.

🚀 Getting Started with Pandas¶

To use Pandas, you first need to install it. This is dependent on your Python environment. If you’re not sure, I recommend Googleing “how to install pandas in [your environment]”.

In most cases, you can install Pandas using pip:

pip install pandas

If you’re using a conda environment, you may wish to install it via conda to ensure compatibility with other packages:

conda install -c conda-forge pandas

What is the conda-forge channel?

If you have used conda before, you may have noticed that it has multiple channels from which you can install packages. The default channel is maintained by Anaconda, but their release cycle can lag behind the latest versions of some packages.

There are many community-driven channels to reflect faster updates and broader coverage of packages. conda-forge is one of the most popular community channels and is known for its extensive collection of packages and timely updates. It is maintained by a large community of contributors who ensure that packages are up-to-date and compatible with each other.

When you specify -c conda-forge in conda install -c conda-forge, you are telling conda to prioritize the conda-forge channel when searching for the package to install. This can be particularly useful if you want to ensure that you are getting the latest version of a package or if a package is not available in the default channel.

Once installed, you can import Pandas in your Python script or Jupyter Notebook:

import pandas as pd

NumPy is another essential library for numerical computing in Python. It provides support for arrays, matrices, and mathematical functions. Pandas is built on top of NumPy and leverages its capabilities for efficient data manipulation. Therefore, it is common to import both libraries together:

import numpy as np
import pandas as pd

🗂️ Pandas Data Structures¶

Pandas provides two primary data structures: Series and DataFrame.

`Series`¶

A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, etc.). It is the basic building block of Pandas. It is similar to a column in a spreadsheet or a database table. Each element in a Series has an associated index label.

Example of creating a Series:

import pandas as pd

data = [1, 2, 3, np.nan, 5, 6]
series = pd.Series(data)
print(series)

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64

type(series)

pandas.core.series.Series

You’ll notice that the Series above looks similar to a Python list, but it has some additional features.

Additional Features of a `Series`¶

🔢 Indexing: Each element in a Series has an associated index label, which allows for easy access and manipulation of data. In the example above, the index labels are 0, 1, 2, 3, 4, 5 (RangeIndex(start=0, stop=6, step=1)).
🧩 Data Types: A Series can hold data of different types, but all elements in a single Series must be of the same type. In the example above, all elements are float64, as indicated by the dtype attribute at the bottom.
🚑 Handling Missing Data: Pandas provides built-in support for handling missing data using NaN (Not a Number) values. In the example above, np.nan represents a missing value.
⚡ Vectorized Operations: You can perform operations on entire Series objects without the need for explicit loops, making it efficient for data manipulation.
📏 Broadcasting: When performing operations on Series, Pandas will automatically extend the operation to all elements or compatible dimensions.

Broadcasting Example

A Python list will not support broadcasting. For example, my_list below will repeat the entire list when you multiply it by 2:

my_list = [1, 2, 3, 4]

print(type(my_list))
display(my_list * 2)

<class 'list'>

[1, 2, 3, 4, 1, 2, 3, 4]

In a Pandas Series, however, the multiplication will be applied to each individual element:

my_series = pd.Series([1, 2, 3, 4])

print(type(my_series))
display(my_series * 2)

<class 'pandas.core.series.Series'>

0    2
1    4
2    6
3    8
dtype: int64

Creating a `Series`¶

To create a Series, you can use the pd.Series() constructor and pass in a list or array-like object:

my_series = pd.Series([10, 20, 30])

my_series

0    10
1    20
2    30
dtype: int64

Using `Series` methods¶

A pandas Series is similar to a Python list. However, a Series provides many methods (equivalent to functions) for you to use.

As an example, num_reviews.mean() will return the average number of reviews in the code below.

reviews_count = [12715, 2274, 2771, 3952, 528, 2766, 724]
num_reviews = pd.Series(reviews_count)

print(num_reviews.mean())

3675.714285714286

There are many other methods available for Series objects, such as sum(), min(), max(), std(), and more. You can explore the Pandas documentation for a comprehensive list of available methods.

`Series` properties¶

You can also access various properties of a Series object, such as dtype, index, values, and shape. These properties provide information about the data type, index labels, and underlying data of the Series, respectively.

num_reviews.dtype

dtype('int64')

num_reviews.index

RangeIndex(start=0, stop=7, step=1)

num_reviews.values

array([12715, 2274, 2771, 3952, 528, 2766, 724], dtype=int64)

num_reviews.shape

(7,)

`DataFrame`¶

A DataFrame is a two-dimensional labeled data structure that can hold multiple Series (columns) of different data types. It is similar to a spreadsheet or a SQL table. Each column in a DataFrame is a Series, and each row represents a record. DataFrame can be thought of as a collection of Series that share the same index.

Why use `DataFrame`?¶

🗂️ Tabular Data: DataFrame is ideal for representing tabular data, where each column can have a different data type (e.g., integers, floats, strings)
🧩 Heterogeneous Data: DataFrame can hold different data types in each column, making it suitable for complex datasets.
🔍 Powerful Indexing and Selection: DataFrame provides advanced indexing and selection capabilities, allowing you to easily access and manipulate subsets of your data. It supports label-based (.loc) and position-based (.iloc) indexing, as well as Boolean filtering for easy data subsetting.
📥 I/O Functionality: DataFrame provides built-in methods for reading from and writing to various file formats (e.g., CSV, Excel, SQL databases), making it easy to import and export data.

Create a `DataFrame`¶

There are many ways to create a DataFrame, but one of the most common methods is to use a dictionary of lists or arrays, where each key represents a column name and the corresponding value is a list or array of data for that column. Pass the dictionary to the pd.DataFrame() constructor:

# Sample accounting data
data = {
    "Date": ["2025-09-01", "2025-09-01", "2025-09-02", "2025-09-03"],
    "Account": ["Cash", "Revenue", "Rent Expense", "Cash"],
    "Debit": [1000, 0, 500, 0],
    "Credit": [0, 1000, 0, 500],
}

# Create DataFrame
df = pd.DataFrame(data)

# Display DataFrame
display(df)

Loading...

print(df)

         Date       Account  Debit  Credit
0  2025-09-01          Cash   1000       0
1  2025-09-01       Revenue      0    1000
2  2025-09-02  Rent Expense    500       0
3  2025-09-03          Cash      0     500

display(df)

Loading...

Concise summary of a `DataFrame`¶

A common way to get a quick overview of a DataFrame is to use the .info() method. This method provides a concise summary of the DataFrame, including the number of non-null entries, data types of each column, and memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     4 non-null      object
 1   Account  4 non-null      object
 2   Debit    4 non-null      int64 
 3   Credit   4 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 260.0+ bytes

👉 From the result of df.info(), we can understand a couple of things:

There are 4 columns.
2 out of 4 columns have the object data type.
The second line of the output tells us the number of rows (“4 entries”).
No columns contain missing values.

Creating a `DataFrame` from a CSV file¶

You can also create a DataFrame by reading data from a CSV file using the pd.read_csv() function. This is a common way to load data into a Pandas DataFrame for analysis. This not only supports local files but also files from a URL. In the code below, we read a CSV file from a URL and create a DataFrame from it.

This is more practical than creating a DataFrame from scratch, as real-world data is often stored in files.

df_products = pd.read_csv(
    "https://raw.githubusercontent.com/bdi475/datasets/main/maven-toys-data/products.csv"
)

display(df_products)

Loading...

df_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Product_ID        35 non-null     int64  
 1   Product_Name      35 non-null     object 
 2   Product_Category  35 non-null     object 
 3   Product_Cost      35 non-null     float64
 4   Product_Price     35 non-null     float64
dtypes: float64(2), int64(1), object(2)
memory usage: 1.5+ KB

Using `DataFrame` methods¶

There are many methods available for DataFrame objects, such as head(), tail(), describe(), info(), and more. You can explore the Pandas documentation for a comprehensive list of available methods.

Display first few rows of a `DataFrame`¶

df_products.head()

Loading...

Display last few rows of a `DataFrame`¶

df_products.tail()

Loading...

df_products.tail(2)

Loading...

Sample random rows from a `DataFrame`¶

# randomly sample one row from the DataFrame
df_products.sample()

Loading...

# randomly sample three rows from the DataFrame
df_products.sample(3)

Loading...

`DataFrame` properties¶

You can also access various properties of a DataFrame object, such as dtypes, index, columns, and shape. These properties provide information about the data types, index labels, column labels, and the shape of the DataFrame, respectively.

df_products.dtypes

Product_ID            int64
Product_Name         object
Product_Category     object
Product_Cost        float64
Product_Price       float64
dtype: object

df_products.index

RangeIndex(start=0, stop=35, step=1)

df_products.values

array([[1, 'Action Figure', 'Toys', 9.99, 15.99],
       [2, 'Animal Figures', 'Toys', 9.99, 12.99],
       [3, "Barrel O' Slime", 'Art & Crafts', 1.99, 3.99],
       [4, 'Chutes & Ladders', 'Games', 9.99, 12.99],
       [5, 'Classic Dominoes', 'Games', 7.99, 9.99],
       [6, 'Colorbuds', 'Electronics', 6.99, 14.99],
       [7, 'Dart Gun', 'Sports & Outdoors', 11.99, 15.99],
       [8, 'Deck Of Cards', 'Games', 3.99, 6.99],
       [9, 'Dino Egg', 'Toys', 9.99, 10.99],
       [10, 'Dinosaur Figures', 'Toys', 10.99, 14.99],
       [11, 'Etch A Sketch', 'Art & Crafts', 10.99, 20.99],
       [12, 'Foam Disk Launcher', 'Sports & Outdoors', 8.99, 11.99],
       [13, 'Gamer Headphones', 'Electronics', 14.99, 20.99],
       [14, 'Glass Marbles', 'Games', 5.99, 10.99],
       [15, 'Hot Wheels 5-Pack', 'Toys', 3.99, 5.99],
       [16, 'Jenga', 'Games', 2.99, 9.99],
       [17, 'Kids Makeup Kit', 'Art & Crafts', 13.99, 19.99],
       [18, 'Lego Bricks', 'Toys', 34.99, 39.99],
       [19, 'Magic Sand', 'Art & Crafts', 13.99, 15.99],
       [20, 'Mini Basketball Hoop', 'Sports & Outdoors', 8.99, 24.99],
       [21, 'Mini Ping Pong Set', 'Sports & Outdoors', 6.99, 9.99],
       [22, 'Monopoly', 'Games', 13.99, 19.99],
       [23, 'Mr. Potatohead', 'Toys', 4.99, 9.99],
       [24, 'Nerf Gun', 'Sports & Outdoors', 14.99, 19.99],
       [25, 'PlayDoh Can', 'Art & Crafts', 1.99, 2.99],
       [26, 'PlayDoh Playset', 'Art & Crafts', 20.99, 24.99],
       [27, 'PlayDoh Toolkit', 'Art & Crafts', 3.99, 4.99],
       [28, 'Playfoam', 'Art & Crafts', 3.99, 10.99],
       [29, 'Plush Pony', 'Toys', 8.99, 19.99],
       [30, "Rubik's Cube", 'Games', 17.99, 19.99],
       [31, 'Splash Balls', 'Sports & Outdoors', 7.99, 8.99],
       [32, 'Supersoaker Water Gun', 'Sports & Outdoors', 11.99, 14.99],
       [33, 'Teddy Bear', 'Toys', 10.99, 12.99],
       [34, 'Toy Robot', 'Electronics', 20.99, 25.99],
       [35, 'Uno Card Game', 'Games', 3.99, 7.99]], dtype=object)

df_products.shape

(35, 5)

.shape property is a tuple

The .shape property of a DataFrame returns a tuple representing the dimensions of the DataFrame. It provides the number of rows and columns in the DataFrame.

A tuple is an ordered collection of items that is immutable (cannot be changed after creation). You can think of it as a fixed-size list that cannot be updated once created. In the case of the .shape property, it returns a tuple with two elements: the first element is the number of rows, and the second element is the number of columns.

df_products.shape[0] # number of rows
df_products.shape[1] # number of columns

num_rows, num_cols = df_products.shape # unpacking the tuple

📜 History of Pandas¶

🚀 Getting Started with Pandas¶

🗂️ Pandas Data Structures¶

Series¶

Additional Features of a Series¶

Creating a Series¶

Using Series methods¶

Series properties¶

DataFrame¶

Why use DataFrame?¶

Create a DataFrame¶

Concise summary of a DataFrame¶

Creating a DataFrame from a CSV file¶

Using DataFrame methods¶

Display first few rows of a DataFrame¶

Display last few rows of a DataFrame¶

Sample random rows from a DataFrame¶

DataFrame properties¶

`Series`¶

Additional Features of a `Series`¶

Creating a `Series`¶

Using `Series` methods¶

`Series` properties¶

`DataFrame`¶

Why use `DataFrame`?¶

Create a `DataFrame`¶

Concise summary of a `DataFrame`¶

Creating a `DataFrame` from a CSV file¶

Using `DataFrame` methods¶

Display first few rows of a `DataFrame`¶

Display last few rows of a `DataFrame`¶

Sample random rows from a `DataFrame`¶

`DataFrame` properties¶