Pandas is a powerful and popular open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. Pandas is built on top of NumPy, another fundamental package for numerical computing in Python, and it is widely used in data science, machine learning, and statistical analysis.

📜 History of Pandas¶
Pandas was created by Wes McKinney for financial analysis in 2008 and has since become one of the most essential tools for data scientists and analysts. The name “Pandas” is derived from “Panel Data,” which refers to multidimensional data sets commonly used in econometrics.
Pandas is popular for several reasons:
- 🗂️ Data Structures: Pandas provides two primary data structures:
Series(1-dimensional) andDataFrame(2-dimensional). These structures are flexible and allow for easy manipulation of data. - 🪒 Data Manipulation: Pandas offers a wide range of functions for data cleaning, transformation, and analysis. It supports operations like filtering, grouping, merging, reshaping, and handling missing data.
- 🔗 Integration: Pandas integrates well with other libraries in the Python ecosystem, such as NumPy, Matplotlib, and Scikit-learn, making it a versatile tool for data science workflows.
- 🏃🏿♀️ Performance: Pandas is optimized for performance and can handle large datasets efficiently, making it suitable for real-world data analysis tasks.
- 📚 Community and Documentation: Pandas has a large and active community, which contributes to its development and provides extensive documentation and tutorials.
🚀 Getting Started with Pandas¶
To use Pandas, you first need to install it. This is dependent on your Python environment. If you’re not sure, I recommend Googleing “how to install pandas in [your environment]”.
In most cases, you can install Pandas using pip:
pip install pandasIf you’re using a conda environment, you may wish to install it via conda to ensure compatibility with other packages:
conda install -c conda-forge pandasOnce installed, you can import Pandas in your Python script or Jupyter Notebook:
import pandas as pdNumPy is another essential library for numerical computing in Python. It provides support for arrays, matrices, and mathematical functions. Pandas is built on top of NumPy and leverages its capabilities for efficient data manipulation. Therefore, it is common to import both libraries together:
import numpy as np
import pandas as pd🗂️ Pandas Data Structures¶
Pandas provides two primary data structures: Series and DataFrame.
Series¶
A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, etc.). It is the basic building block of Pandas. It is similar to a column in a spreadsheet or a database table. Each element in a Series has an associated index label.
Example of creating a Series:
import pandas as pd
data = [1, 2, 3, np.nan, 5, 6]
series = pd.Series(data)
print(series)0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
dtype: float64
type(series)pandas.core.series.SeriesYou’ll notice that the Series above looks similar to a Python list, but it has some additional features.
Additional Features of a Series¶
- 🔢 Indexing: Each element in a
Serieshas an associated index label, which allows for easy access and manipulation of data. In the example above, the index labels are0, 1, 2, 3, 4, 5(RangeIndex(start=0, stop=6, step=1)). - 🧩 Data Types: A
Seriescan hold data of different types, but all elements in a singleSeriesmust be of the same type. In the example above, all elements arefloat64, as indicated by thedtypeattribute at the bottom. - 🚑 Handling Missing Data: Pandas provides built-in support for handling missing data using
NaN(Not a Number) values. In the example above,np.nanrepresents a missing value. - ⚡ Vectorized Operations: You can perform operations on entire
Seriesobjects without the need for explicit loops, making it efficient for data manipulation. - 📏 Broadcasting: When performing operations on
Series, Pandas will automatically extend the operation to all elements or compatible dimensions.
Broadcasting Example
A Python list will not support broadcasting. For example, my_list below will repeat the entire list when you multiply it by 2:
my_list = [1, 2, 3, 4]
print(type(my_list))
display(my_list * 2)<class 'list'>
[1, 2, 3, 4, 1, 2, 3, 4]In a Pandas Series, however, the multiplication will be applied to each individual element:
my_series = pd.Series([1, 2, 3, 4])
print(type(my_series))
display(my_series * 2)<class 'pandas.core.series.Series'>
0 2
1 4
2 6
3 8
dtype: int64Creating a Series¶
To create a Series, you can use the pd.Series() constructor and pass in a list or array-like object:
my_series = pd.Series([10, 20, 30])
my_series0 10
1 20
2 30
dtype: int64Using Series methods¶
A pandas Series is similar to a Python list. However, a Series provides many methods (equivalent to functions) for you to use.
As an example, num_reviews.mean() will return the average number of reviews in the code below.
reviews_count = [12715, 2274, 2771, 3952, 528, 2766, 724]
num_reviews = pd.Series(reviews_count)
print(num_reviews.mean())3675.714285714286
There are many other methods available for Series objects, such as sum(), min(), max(), std(), and more. You can explore the Pandas documentation for a comprehensive list of available methods.
Series properties¶
You can also access various properties of a Series object, such as dtype, index, values, and shape. These properties provide information about the data type, index labels, and underlying data of the Series, respectively.
num_reviews.dtypedtype('int64')num_reviews.indexRangeIndex(start=0, stop=7, step=1)num_reviews.valuesarray([12715, 2274, 2771, 3952, 528, 2766, 724], dtype=int64)num_reviews.shape(7,)DataFrame¶
A DataFrame is a two-dimensional labeled data structure that can hold multiple Series (columns) of different data types. It is similar to a spreadsheet or a SQL table. Each column in a DataFrame is a Series, and each row represents a record. DataFrame can be thought of as a collection of Series that share the same index.
Why use DataFrame?¶
- 🗂️ Tabular Data:
DataFrameis ideal for representing tabular data, where each column can have a different data type (e.g., integers, floats, strings) - 🧩 Heterogeneous Data:
DataFramecan hold different data types in each column, making it suitable for complex datasets. - 🔍 Powerful Indexing and Selection:
DataFrameprovides advanced indexing and selection capabilities, allowing you to easily access and manipulate subsets of your data. It supports label-based (.loc) and position-based (.iloc) indexing, as well as Boolean filtering for easy data subsetting. - 📥 I/O Functionality:
DataFrameprovides built-in methods for reading from and writing to various file formats (e.g., CSV, Excel, SQL databases), making it easy to import and export data.
Create a DataFrame¶
There are many ways to create a DataFrame, but one of the most common methods is to use a dictionary of lists or arrays, where each key represents a column name and the corresponding value is a list or array of data for that column. Pass the dictionary to the pd.DataFrame() constructor:
# Sample accounting data
data = {
"Date": ["2025-09-01", "2025-09-01", "2025-09-02", "2025-09-03"],
"Account": ["Cash", "Revenue", "Rent Expense", "Cash"],
"Debit": [1000, 0, 500, 0],
"Credit": [0, 1000, 0, 500],
}
# Create DataFrame
df = pd.DataFrame(data)
# Display DataFrame
display(df)print(df) Date Account Debit Credit
0 2025-09-01 Cash 1000 0
1 2025-09-01 Revenue 0 1000
2 2025-09-02 Rent Expense 500 0
3 2025-09-03 Cash 0 500
display(df)Concise summary of a DataFrame¶
A common way to get a quick overview of a DataFrame is to use the .info() method. This method provides a concise summary of the DataFrame, including the number of non-null entries, data types of each column, and memory usage.
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 4 non-null object
1 Account 4 non-null object
2 Debit 4 non-null int64
3 Credit 4 non-null int64
dtypes: int64(2), object(2)
memory usage: 260.0+ bytes
👉 From the result of df.info(), we can understand a couple of things:
- There are 4 columns.
- 2 out of 4 columns have the
objectdata type. - The second line of the output tells us the number of rows (“4 entries”).
- No columns contain missing values.
Creating a DataFrame from a CSV file¶
You can also create a DataFrame by reading data from a CSV file using the pd.read_csv() function. This is a common way to load data into a Pandas DataFrame for analysis. This not only supports local files but also files from a URL. In the code below, we read a CSV file from a URL and create a DataFrame from it.
This is more practical than creating a DataFrame from scratch, as real-world data is often stored in files.
df_products = pd.read_csv(
"https://raw.githubusercontent.com/bdi475/datasets/main/maven-toys-data/products.csv"
)
display(df_products)df_products.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Product_ID 35 non-null int64
1 Product_Name 35 non-null object
2 Product_Category 35 non-null object
3 Product_Cost 35 non-null float64
4 Product_Price 35 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 1.5+ KB
Using DataFrame methods¶
There are many methods available for DataFrame objects, such as head(), tail(), describe(), info(), and more. You can explore the Pandas documentation for a comprehensive list of available methods.
Display first few rows of a DataFrame¶
df_products.head()Display last few rows of a DataFrame¶
df_products.tail()df_products.tail(2)Sample random rows from a DataFrame¶
# randomly sample one row from the DataFrame
df_products.sample()# randomly sample three rows from the DataFrame
df_products.sample(3)DataFrame properties¶
You can also access various properties of a DataFrame object, such as dtypes, index, columns, and shape. These properties provide information about the data types, index labels, column labels, and the shape of the DataFrame, respectively.
df_products.dtypesProduct_ID int64
Product_Name object
Product_Category object
Product_Cost float64
Product_Price float64
dtype: objectdf_products.indexRangeIndex(start=0, stop=35, step=1)df_products.valuesarray([[1, 'Action Figure', 'Toys', 9.99, 15.99],
[2, 'Animal Figures', 'Toys', 9.99, 12.99],
[3, "Barrel O' Slime", 'Art & Crafts', 1.99, 3.99],
[4, 'Chutes & Ladders', 'Games', 9.99, 12.99],
[5, 'Classic Dominoes', 'Games', 7.99, 9.99],
[6, 'Colorbuds', 'Electronics', 6.99, 14.99],
[7, 'Dart Gun', 'Sports & Outdoors', 11.99, 15.99],
[8, 'Deck Of Cards', 'Games', 3.99, 6.99],
[9, 'Dino Egg', 'Toys', 9.99, 10.99],
[10, 'Dinosaur Figures', 'Toys', 10.99, 14.99],
[11, 'Etch A Sketch', 'Art & Crafts', 10.99, 20.99],
[12, 'Foam Disk Launcher', 'Sports & Outdoors', 8.99, 11.99],
[13, 'Gamer Headphones', 'Electronics', 14.99, 20.99],
[14, 'Glass Marbles', 'Games', 5.99, 10.99],
[15, 'Hot Wheels 5-Pack', 'Toys', 3.99, 5.99],
[16, 'Jenga', 'Games', 2.99, 9.99],
[17, 'Kids Makeup Kit', 'Art & Crafts', 13.99, 19.99],
[18, 'Lego Bricks', 'Toys', 34.99, 39.99],
[19, 'Magic Sand', 'Art & Crafts', 13.99, 15.99],
[20, 'Mini Basketball Hoop', 'Sports & Outdoors', 8.99, 24.99],
[21, 'Mini Ping Pong Set', 'Sports & Outdoors', 6.99, 9.99],
[22, 'Monopoly', 'Games', 13.99, 19.99],
[23, 'Mr. Potatohead', 'Toys', 4.99, 9.99],
[24, 'Nerf Gun', 'Sports & Outdoors', 14.99, 19.99],
[25, 'PlayDoh Can', 'Art & Crafts', 1.99, 2.99],
[26, 'PlayDoh Playset', 'Art & Crafts', 20.99, 24.99],
[27, 'PlayDoh Toolkit', 'Art & Crafts', 3.99, 4.99],
[28, 'Playfoam', 'Art & Crafts', 3.99, 10.99],
[29, 'Plush Pony', 'Toys', 8.99, 19.99],
[30, "Rubik's Cube", 'Games', 17.99, 19.99],
[31, 'Splash Balls', 'Sports & Outdoors', 7.99, 8.99],
[32, 'Supersoaker Water Gun', 'Sports & Outdoors', 11.99, 14.99],
[33, 'Teddy Bear', 'Toys', 10.99, 12.99],
[34, 'Toy Robot', 'Electronics', 20.99, 25.99],
[35, 'Uno Card Game', 'Games', 3.99, 7.99]], dtype=object)df_products.shape(35, 5)