Skip to article frontmatterSkip to article content

Pandas String Methods

While Pandas was initially developed for financial data analysis, it now provides robust support for string manipulation.

These string methods allow for operations such as searching, replacing, splitting, and pattern matching using regular expressions.

The string methods in Pandas are accessed through the .str accessor, which can be applied to Series and DataFrame columns containing string data. The operations are vectorized, meaning they can be applied to entire columns efficiently.

▶️ Import pandas and numpy.

import pandas as pd
import numpy as np

🧹 Clean Data using String Methods

We will use a sample dataset containing course information to demonstrate various string methods in Pandas.

▶️ Create a new DataFrame named df_courses.

data = {
    "Course Code": ["      fin 221  ", "ACCY 301", "      BAdm 350", "FIN 445      "],
    "Course Title": [
        "Corporate Finance",
        "Atg Measurement & Disclosure",
        "IT for Networked Orgs",
        "Real Estate Investment",
    ],
    "Semesters": ["Fall 2025", "Fall_2025", "Spring/2026", "Spring_2026"],
}
df_courses = pd.DataFrame(data)

display(df_courses)
Loading...

The Course_Code and Semesters columns contain inconsistent formatting, such as leading/trailing spaces and mixed case letters. We will use string methods to clean this data.

🔠 Change Case

It’s common to standardize text by converting it to a consistent case.

  • .str.lower(): Converts all characters in the string to lowercase.
  • .str.upper(): Converts all characters in the string to uppercase.
  • .str.title(): Converts the first character of each word to uppercase and the rest to lowercase.
  • .str.capitalize(): Converts the first character of the string to uppercase and the rest to lowercase.

▶️ Use the .str.upper() method to convert all text in the Course_Code column to uppercase letters.

df_courses["Course Code"] = df_courses["Course Code"].str.upper()

display(df_courses)
Loading...

↔️ Remove whitespace

Text data often has extra whitespace that needs to be removed.

  • .str.strip(): Removes leading and trailing whitespace.
  • .str.lstrip(): Removes leading whitespace.
  • .str.rstrip(): Removes trailing whitespace.

▶️ The "Course_Code" column has leading and trailing spaces. Use .values to print the raw values of the column.

print("Before removing whitespaces:")
df_courses["Course Code"].values
Before removing whitespaces:
array([' FIN 221 ', 'ACCY 301', ' BADM 350', 'FIN 445 '], dtype=object)

▶️ Use the .str.strip() method to remove leading and trailing whitespace from the Course_Code column.

df_courses["Course Code"] = df_courses["Course Code"].str.strip()

▶️ Check the cleaned values in the Course_Code column using .values.

print("After removing whitespaces:")
df_courses["Course Code"].values
After removing whitespaces:
array(['FIN 221', 'ACCY 301', 'BADM 350', 'FIN 445'], dtype=object)

🔁 Replace Substrings

The "Semesters" column uses inconsistent delimiters (/ and _). We will standardize this by replacing these characters with a space.

df_courses["Semesters"] = (
    df_courses["Semesters"].str.replace("/", " ").str.replace("_", " ")
)

display(df_courses)
Loading...

🔍 Check for Substrings

We can check if a substring exists within each string in a column using the .str.contains() method. This method returns a boolean Series indicating whether each string contains the specified substring.

▶️ Use the .str.contains() method to indicate whether a course is a Finance course (i.e., if the "Course Code" contains the substring "FIN"). Store the result in a new column named "Is Finance".

df_courses["Is Finance"] = df_courses["Course Code"].str.contains("FIN")

display(df_courses)
Loading...

📚 List of .str Methods

Here are some commonly used string methods available through the .str accessor in Pandas. You will not need to memorize all of these, but it’s useful to be familiar with them and know where to find this reference.

MethodDescriptionExample Usage
.str.lower()Converts all characters in the string to lowercasedf["col"].str.lower()
.str.upper()Converts all characters in the string to uppercasedf["col"].str.upper()
.str.title()Converts the first character of each word to uppercase and the restdf["col"].str.title()
.str.capitalize()Converts the first character of the string to uppercase and the rest to lowercasedf["col"].str.capitalize()
.str.strip()Removes leading and trailing whitespacedf["col"].str.strip()
.str.lstrip()Removes leading whitespacedf["col"].str.lstrip()
.str.rstrip()Removes trailing whitespacedf["col"].str.rstrip()
.str.replace(old, new)Replaces occurrences of a substring with another substringdf["col"].str.replace("old", "new")
.str.contains(substring)Checks if the string contains a specified substring. Returns a boolean Seriesdf["col"].str.contains("pattern")
.str.startswith(prefix)Checks if the string starts with a specified prefix. Returns a boolean Seriesdf["col"].str.startswith("prefix")
.str.endswith(suffix)Checks if the string ends with a specified suffix. Returns a boolean Seriesdf["col"].str.endswith("suffix")
.str.len()Returns the length of each string in the Seriesdf["col"].str.len()
.str.split(delimiter)Splits each string by the specified delimiter and returns a list of substringsdf["col"].str.split(",")
.str.join(delimiter)Joins lists of strings in each element of the Series into a single string, separated by the specified delimiterdf["col"].str.join(",")
.str.extract(pattern)Extracts groups from the first match of a regular expression patterndf["col"].str.extract(r"(\\d+)")
.str.pad(width, side='left', fillchar=' ')Pads strings to a specified width with a specified characterdf["col"].str.pad(10, side='left', fillchar='0')
.str.zfill(width)Pads strings on the left with zeros to a specified widthdf["col"].str.zfill(5)

🧵 Using Regular Expression to Extract Information

A regular expression (regex or regexp for short) is a sequence of characters that forms a search pattern. It can be used for string matching, extraction, and manipulation.

Regexs are powerful tools for pattern matching and text processing. They allow you to define complex search patterns using a combination of literal characters and special symbols. Although it may seem complex at first, mastering regex can greatly enhance your ability to work with text data.

Pandas string methods support regex for advanced string operations. For example, you can use regex patterns with methods like .str.contains(), .str.replace(), and .str.extract() to perform sophisticated text manipulations.

In this example, we will extract course information from a single string column.

▶️ Create a new DataFrame named df_info.

df_info = pd.DataFrame(
    {
        "Course Title": [
            "CS 124: Introduction to Computer Science",
            "FIN221 - Corporate Finance",
            "PSYC 100-Introduction to Psychology",
            "MATH285 Intro to Differential Equations",
        ]
    }
)

display(df_info)
Loading...

We need to define a pattern with “capture groups” (sections enclosed in parentheses ()) for each piece of information we want to extract.

Here’s the pattern we’ll use: ^([A-Z]{2,4})\s?(\d{3})\s?[:-]?\s?(.*)$

Let’s break that down:

  • ^: Asserts the start of the string.
  • ([A-Z]{2,4}): (Capture Group 1) Matches and captures 2 to 4 uppercase letters (the department code, like “CS” or “MATH”).
  • \s?: Matches an optional single whitespace character.
  • (\d{3}): (Capture Group 2) Matches and captures exactly 3 digits (the course number, like “124” or “221”).
  • \s?[:-]?\s?: Matches an optional separator, which could be a space, a colon :, or a hyphen -.
  • (.*): (Capture Group 3) Matches and captures all remaining characters in the string (the course title).
  • $: Asserts the end of the string.
# Define the regex pattern with capture groups
pattern = r"^([A-Z]{2,4})\s?(\d{3})\s?[:-]?\s?(.*)$"

# Apply the .extract() method
df_info[["Dept", "Number", "Title"]] = df_info["Course Title"].str.extract(pattern)

display(df_info)
Loading...