While Pandas was initially developed for financial data analysis, it now provides robust support for string manipulation.
These string methods allow for operations such as searching, replacing, splitting, and pattern matching using regular expressions.
The string methods in Pandas are accessed through the .str accessor, which can be applied to Series and DataFrame columns containing string data. The operations are vectorized, meaning they can be applied to entire columns efficiently.
▶️ Import pandas and numpy.
import pandas as pd
import numpy as np🧹 Clean Data using String Methods¶
We will use a sample dataset containing course information to demonstrate various string methods in Pandas.
▶️ Create a new DataFrame named df_courses.
data = {
"Course Code": [" fin 221 ", "ACCY 301", " BAdm 350", "FIN 445 "],
"Course Title": [
"Corporate Finance",
"Atg Measurement & Disclosure",
"IT for Networked Orgs",
"Real Estate Investment",
],
"Semesters": ["Fall 2025", "Fall_2025", "Spring/2026", "Spring_2026"],
}
df_courses = pd.DataFrame(data)
display(df_courses)The Course_Code and Semesters columns contain inconsistent formatting, such as leading/trailing spaces and mixed case letters. We will use string methods to clean this data.
🔠 Change Case¶
It’s common to standardize text by converting it to a consistent case.
.str.lower(): Converts all characters in the string to lowercase..str.upper(): Converts all characters in the string to uppercase..str.title(): Converts the first character of each word to uppercase and the rest to lowercase..str.capitalize(): Converts the first character of the string to uppercase and the rest to lowercase.
▶️ Use the .str.upper() method to convert all text in the Course_Code column to uppercase letters.
df_courses["Course Code"] = df_courses["Course Code"].str.upper()
display(df_courses)↔️ Remove whitespace¶
Text data often has extra whitespace that needs to be removed.
.str.strip(): Removes leading and trailing whitespace..str.lstrip(): Removes leading whitespace..str.rstrip(): Removes trailing whitespace.
▶️ The "Course_Code" column has leading and trailing spaces. Use .values to print the raw values of the column.
print("Before removing whitespaces:")
df_courses["Course Code"].valuesBefore removing whitespaces:
array([' FIN 221 ', 'ACCY 301', ' BADM 350', 'FIN 445 '],
dtype=object)▶️ Use the .str.strip() method to remove leading and trailing whitespace from the Course_Code column.
df_courses["Course Code"] = df_courses["Course Code"].str.strip()▶️ Check the cleaned values in the Course_Code column using .values.
print("After removing whitespaces:")
df_courses["Course Code"].valuesAfter removing whitespaces:
array(['FIN 221', 'ACCY 301', 'BADM 350', 'FIN 445'], dtype=object)🔁 Replace Substrings¶
The "Semesters" column uses inconsistent delimiters (/ and _). We will standardize this by replacing these characters with a space.
df_courses["Semesters"] = (
df_courses["Semesters"].str.replace("/", " ").str.replace("_", " ")
)
display(df_courses)🔍 Check for Substrings¶
We can check if a substring exists within each string in a column using the .str.contains() method. This method returns a boolean Series indicating whether each string contains the specified substring.
▶️ Use the .str.contains() method to indicate whether a course is a Finance course (i.e., if the "Course Code" contains the substring "FIN"). Store the result in a new column named "Is Finance".
df_courses["Is Finance"] = df_courses["Course Code"].str.contains("FIN")
display(df_courses)📚 List of .str Methods¶
Here are some commonly used string methods available through the .str accessor in Pandas. You will not need to memorize all of these, but it’s useful to be familiar with them and know where to find this reference.
| Method | Description | Example Usage |
|---|---|---|
.str.lower() | Converts all characters in the string to lowercase | df["col"].str.lower() |
.str.upper() | Converts all characters in the string to uppercase | df["col"].str.upper() |
.str.title() | Converts the first character of each word to uppercase and the rest | df["col"].str.title() |
.str.capitalize() | Converts the first character of the string to uppercase and the rest to lowercase | df["col"].str.capitalize() |
.str.strip() | Removes leading and trailing whitespace | df["col"].str.strip() |
.str.lstrip() | Removes leading whitespace | df["col"].str.lstrip() |
.str.rstrip() | Removes trailing whitespace | df["col"].str.rstrip() |
.str.replace(old, new) | Replaces occurrences of a substring with another substring | df["col"].str.replace("old", "new") |
.str.contains(substring) | Checks if the string contains a specified substring. Returns a boolean Series | df["col"].str.contains("pattern") |
.str.startswith(prefix) | Checks if the string starts with a specified prefix. Returns a boolean Series | df["col"].str.startswith("prefix") |
.str.endswith(suffix) | Checks if the string ends with a specified suffix. Returns a boolean Series | df["col"].str.endswith("suffix") |
.str.len() | Returns the length of each string in the Series | df["col"].str.len() |
.str.split(delimiter) | Splits each string by the specified delimiter and returns a list of substrings | df["col"].str.split(",") |
.str.join(delimiter) | Joins lists of strings in each element of the Series into a single string, separated by the specified delimiter | df["col"].str.join(",") |
.str.extract(pattern) | Extracts groups from the first match of a regular expression pattern | df["col"].str.extract(r"(\\d+)") |
.str.pad(width, side='left', fillchar=' ') | Pads strings to a specified width with a specified character | df["col"].str.pad(10, side='left', fillchar='0') |
.str.zfill(width) | Pads strings on the left with zeros to a specified width | df["col"].str.zfill(5) |
🧵 Using Regular Expression to Extract Information¶
A regular expression (regex or regexp for short) is a sequence of characters that forms a search pattern. It can be used for string matching, extraction, and manipulation.
Regexs are powerful tools for pattern matching and text processing. They allow you to define complex search patterns using a combination of literal characters and special symbols. Although it may seem complex at first, mastering regex can greatly enhance your ability to work with text data.
Pandas string methods support regex for advanced string operations. For example, you can use regex patterns with methods like .str.contains(), .str.replace(), and .str.extract() to perform sophisticated text manipulations.
In this example, we will extract course information from a single string column.
▶️ Create a new DataFrame named df_info.
df_info = pd.DataFrame(
{
"Course Title": [
"CS 124: Introduction to Computer Science",
"FIN221 - Corporate Finance",
"PSYC 100-Introduction to Psychology",
"MATH285 Intro to Differential Equations",
]
}
)
display(df_info)We need to define a pattern with “capture groups” (sections enclosed in parentheses ()) for each piece of information we want to extract.
Here’s the pattern we’ll use: ^([A-Z]{2,4})\s?(\d{3})\s?[:-]?\s?(.*)$
Let’s break that down:
^: Asserts the start of the string.([A-Z]{2,4}): (Capture Group 1) Matches and captures 2 to 4 uppercase letters (the department code, like “CS” or “MATH”).\s?: Matches an optional single whitespace character.(\d{3}): (Capture Group 2) Matches and captures exactly 3 digits (the course number, like “124” or “221”).\s?[:-]?\s?: Matches an optional separator, which could be a space, a colon :, or a hyphen -.(.*): (Capture Group 3) Matches and captures all remaining characters in the string (the course title).$: Asserts the end of the string.
# Define the regex pattern with capture groups
pattern = r"^([A-Z]{2,4})\s?(\d{3})\s?[:-]?\s?(.*)$"
# Apply the .extract() method
df_info[["Dept", "Number", "Title"]] = df_info["Course Title"].str.extract(pattern)
display(df_info)