Skip to article frontmatterSkip to article content

Introduction to Pandas

Pandas is a powerful and popular open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. Pandas is built on top of NumPy, another fundamental package for numerical computing in Python, and it is widely used in data science, machine learning, and statistical analysis.

Pandas logo

📜 History of Pandas

Pandas was created by Wes McKinney for financial analysis in 2008 and has since become one of the most essential tools for data scientists and analysts. The name “Pandas” is derived from “Panel Data,” which refers to multidimensional data sets commonly used in econometrics.

Pandas is popular for several reasons:

  1. 🗂️ Data Structures: Pandas provides two primary data structures: Series (1-dimensional) and DataFrame (2-dimensional). These structures are flexible and allow for easy manipulation of data.
  2. 🪒 Data Manipulation: Pandas offers a wide range of functions for data cleaning, transformation, and analysis. It supports operations like filtering, grouping, merging, reshaping, and handling missing data.
  3. 🔗 Integration: Pandas integrates well with other libraries in the Python ecosystem, such as NumPy, Matplotlib, and Scikit-learn, making it a versatile tool for data science workflows.
  4. 🏃🏿‍♀️ Performance: Pandas is optimized for performance and can handle large datasets efficiently, making it suitable for real-world data analysis tasks.
  5. 📚 Community and Documentation: Pandas has a large and active community, which contributes to its development and provides extensive documentation and tutorials.

🚀 Getting Started with Pandas

To use Pandas, you first need to install it. This is dependent on your Python environment. If you’re not sure, I recommend Googleing “how to install pandas in [your environment]”.

In most cases, you can install Pandas using pip:

pip install pandas

If you’re using a conda environment, you may wish to install it via conda to ensure compatibility with other packages:

conda install -c conda-forge pandas

Once installed, you can import Pandas in your Python script or Jupyter Notebook:

import pandas as pd

NumPy is another essential library for numerical computing in Python. It provides support for arrays, matrices, and mathematical functions. Pandas is built on top of NumPy and leverages its capabilities for efficient data manipulation. Therefore, it is common to import both libraries together:

import numpy as np
import pandas as pd

🗂️ Pandas Data Structures

Pandas provides two primary data structures: Series and DataFrame.

Series

A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, etc.). It is the basic building block of Pandas. It is similar to a column in a spreadsheet or a database table. Each element in a Series has an associated index label.

Example of creating a Series:

import pandas as pd

data = [1, 2, 3, np.nan, 5, 6]
series = pd.Series(data)
print(series)
0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64
type(series)
pandas.core.series.Series

You’ll notice that the Series above looks similar to a Python list, but it has some additional features.

Additional Features of a Series

  1. 🔢 Indexing: Each element in a Series has an associated index label, which allows for easy access and manipulation of data. In the example above, the index labels are 0, 1, 2, 3, 4, 5 (RangeIndex(start=0, stop=6, step=1)).
  2. 🧩 Data Types: A Series can hold data of different types, but all elements in a single Series must be of the same type. In the example above, all elements are float64, as indicated by the dtype attribute at the bottom.
  3. 🚑 Handling Missing Data: Pandas provides built-in support for handling missing data using NaN (Not a Number) values. In the example above, np.nan represents a missing value.
  4. Vectorized Operations: You can perform operations on entire Series objects without the need for explicit loops, making it efficient for data manipulation.
  5. 📏 Broadcasting: When performing operations on Series, Pandas will automatically extend the operation to all elements or compatible dimensions.

Broadcasting Example

A Python list will not support broadcasting. For example, my_list below will repeat the entire list when you multiply it by 2:

my_list = [1, 2, 3, 4]

print(type(my_list))
display(my_list * 2)
<class 'list'>
[1, 2, 3, 4, 1, 2, 3, 4]

In a Pandas Series, however, the multiplication will be applied to each individual element:

my_series = pd.Series([1, 2, 3, 4])

print(type(my_series))
display(my_series * 2)
<class 'pandas.core.series.Series'>
0 2 1 4 2 6 3 8 dtype: int64

Creating a Series

To create a Series, you can use the pd.Series() constructor and pass in a list or array-like object:

my_series = pd.Series([10, 20, 30])

my_series
0 10 1 20 2 30 dtype: int64

Using Series methods

A pandas Series is similar to a Python list. However, a Series provides many methods (equivalent to functions) for you to use.

As an example, num_reviews.mean() will return the average number of reviews in the code below.

reviews_count = [12715, 2274, 2771, 3952, 528, 2766, 724]
num_reviews = pd.Series(reviews_count)

print(num_reviews.mean())
3675.714285714286

There are many other methods available for Series objects, such as sum(), min(), max(), std(), and more. You can explore the Pandas documentation for a comprehensive list of available methods.

Series properties

You can also access various properties of a Series object, such as dtype, index, values, and shape. These properties provide information about the data type, index labels, and underlying data of the Series, respectively.

num_reviews.dtype
dtype('int64')
num_reviews.index
RangeIndex(start=0, stop=7, step=1)
num_reviews.values
array([12715, 2274, 2771, 3952, 528, 2766, 724], dtype=int64)
num_reviews.shape
(7,)

DataFrame

A DataFrame is a two-dimensional labeled data structure that can hold multiple Series (columns) of different data types. It is similar to a spreadsheet or a SQL table. Each column in a DataFrame is a Series, and each row represents a record. DataFrame can be thought of as a collection of Series that share the same index.

Why use DataFrame?

  1. 🗂️ Tabular Data: DataFrame is ideal for representing tabular data, where each column can have a different data type (e.g., integers, floats, strings)
  2. 🧩 Heterogeneous Data: DataFrame can hold different data types in each column, making it suitable for complex datasets.
  3. 🔍 Powerful Indexing and Selection: DataFrame provides advanced indexing and selection capabilities, allowing you to easily access and manipulate subsets of your data. It supports label-based (.loc) and position-based (.iloc) indexing, as well as Boolean filtering for easy data subsetting.
  4. 📥 I/O Functionality: DataFrame provides built-in methods for reading from and writing to various file formats (e.g., CSV, Excel, SQL databases), making it easy to import and export data.

Create a DataFrame

There are many ways to create a DataFrame, but one of the most common methods is to use a dictionary of lists or arrays, where each key represents a column name and the corresponding value is a list or array of data for that column. Pass the dictionary to the pd.DataFrame() constructor:

# Sample accounting data
data = {
    "Date": ["2025-09-01", "2025-09-01", "2025-09-02", "2025-09-03"],
    "Account": ["Cash", "Revenue", "Rent Expense", "Cash"],
    "Debit": [1000, 0, 500, 0],
    "Credit": [0, 1000, 0, 500],
}

# Create DataFrame
df = pd.DataFrame(data)

# Display DataFrame
display(df)
Loading...
print(df)
         Date       Account  Debit  Credit
0  2025-09-01          Cash   1000       0
1  2025-09-01       Revenue      0    1000
2  2025-09-02  Rent Expense    500       0
3  2025-09-03          Cash      0     500
display(df)
Loading...

Concise summary of a DataFrame

A common way to get a quick overview of a DataFrame is to use the .info() method. This method provides a concise summary of the DataFrame, including the number of non-null entries, data types of each column, and memory usage.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Date     4 non-null      object
 1   Account  4 non-null      object
 2   Debit    4 non-null      int64 
 3   Credit   4 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 260.0+ bytes

👉 From the result of df.info(), we can understand a couple of things:

  • There are 4 columns.
  • 2 out of 4 columns have the object data type.
  • The second line of the output tells us the number of rows (“4 entries”).
  • No columns contain missing values.

Creating a DataFrame from a CSV file

You can also create a DataFrame by reading data from a CSV file using the pd.read_csv() function. This is a common way to load data into a Pandas DataFrame for analysis. This not only supports local files but also files from a URL. In the code below, we read a CSV file from a URL and create a DataFrame from it.

This is more practical than creating a DataFrame from scratch, as real-world data is often stored in files.

df_products = pd.read_csv(
    "https://raw.githubusercontent.com/bdi475/datasets/main/maven-toys-data/products.csv"
)

display(df_products)
Loading...
df_products.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Product_ID        35 non-null     int64  
 1   Product_Name      35 non-null     object 
 2   Product_Category  35 non-null     object 
 3   Product_Cost      35 non-null     float64
 4   Product_Price     35 non-null     float64
dtypes: float64(2), int64(1), object(2)
memory usage: 1.5+ KB

Using DataFrame methods

There are many methods available for DataFrame objects, such as head(), tail(), describe(), info(), and more. You can explore the Pandas documentation for a comprehensive list of available methods.

Display first few rows of a DataFrame
df_products.head()
Loading...
Display last few rows of a DataFrame
df_products.tail()
Loading...
df_products.tail(2)
Loading...
Sample random rows from a DataFrame
# randomly sample one row from the DataFrame
df_products.sample()
Loading...
# randomly sample three rows from the DataFrame
df_products.sample(3)
Loading...

DataFrame properties

You can also access various properties of a DataFrame object, such as dtypes, index, columns, and shape. These properties provide information about the data types, index labels, column labels, and the shape of the DataFrame, respectively.

df_products.dtypes
Product_ID int64 Product_Name object Product_Category object Product_Cost float64 Product_Price float64 dtype: object
df_products.index
RangeIndex(start=0, stop=35, step=1)
df_products.values
array([[1, 'Action Figure', 'Toys', 9.99, 15.99], [2, 'Animal Figures', 'Toys', 9.99, 12.99], [3, "Barrel O' Slime", 'Art & Crafts', 1.99, 3.99], [4, 'Chutes & Ladders', 'Games', 9.99, 12.99], [5, 'Classic Dominoes', 'Games', 7.99, 9.99], [6, 'Colorbuds', 'Electronics', 6.99, 14.99], [7, 'Dart Gun', 'Sports & Outdoors', 11.99, 15.99], [8, 'Deck Of Cards', 'Games', 3.99, 6.99], [9, 'Dino Egg', 'Toys', 9.99, 10.99], [10, 'Dinosaur Figures', 'Toys', 10.99, 14.99], [11, 'Etch A Sketch', 'Art & Crafts', 10.99, 20.99], [12, 'Foam Disk Launcher', 'Sports & Outdoors', 8.99, 11.99], [13, 'Gamer Headphones', 'Electronics', 14.99, 20.99], [14, 'Glass Marbles', 'Games', 5.99, 10.99], [15, 'Hot Wheels 5-Pack', 'Toys', 3.99, 5.99], [16, 'Jenga', 'Games', 2.99, 9.99], [17, 'Kids Makeup Kit', 'Art & Crafts', 13.99, 19.99], [18, 'Lego Bricks', 'Toys', 34.99, 39.99], [19, 'Magic Sand', 'Art & Crafts', 13.99, 15.99], [20, 'Mini Basketball Hoop', 'Sports & Outdoors', 8.99, 24.99], [21, 'Mini Ping Pong Set', 'Sports & Outdoors', 6.99, 9.99], [22, 'Monopoly', 'Games', 13.99, 19.99], [23, 'Mr. Potatohead', 'Toys', 4.99, 9.99], [24, 'Nerf Gun', 'Sports & Outdoors', 14.99, 19.99], [25, 'PlayDoh Can', 'Art & Crafts', 1.99, 2.99], [26, 'PlayDoh Playset', 'Art & Crafts', 20.99, 24.99], [27, 'PlayDoh Toolkit', 'Art & Crafts', 3.99, 4.99], [28, 'Playfoam', 'Art & Crafts', 3.99, 10.99], [29, 'Plush Pony', 'Toys', 8.99, 19.99], [30, "Rubik's Cube", 'Games', 17.99, 19.99], [31, 'Splash Balls', 'Sports & Outdoors', 7.99, 8.99], [32, 'Supersoaker Water Gun', 'Sports & Outdoors', 11.99, 14.99], [33, 'Teddy Bear', 'Toys', 10.99, 12.99], [34, 'Toy Robot', 'Electronics', 20.99, 25.99], [35, 'Uno Card Game', 'Games', 3.99, 7.99]], dtype=object)
df_products.shape
(35, 5)