Skip to article frontmatterSkip to article content

Natural Language Processing (NLP) is a field of computer science that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to enable computers to understand, interpret, and generate human language in a way that is valuable.

Although NLP includes a wide range of techniques and applications, some of the most common tasks include:

  1. Tokenization: Breaking down text into smaller units, such as words or sentences.

  2. Lowercasing: Converting all characters in the text to lowercase to ensure uniformity.

  3. Lemmatization: Reducing words to their base or root form:

    • Example: “running” becomes “run”

    • Example: “tasks” becomes “task”

  4. Special Character Removal: Stripping out punctuation, numbers, and other non-alphabetic characters from the text.

  5. Stopword Removal: Eliminating common words (e.g., “the”, “is”, “and”) that do not contribute significantly to the meaning of the text.

Import Packages

spaCy is a NLP library in Python that provides tools for tokenization, lemmatization, and more. You may have used nltk or textblob before, but spaCy is known for its speed and efficiency. For small tasks like this, you will not notice much difference, but for larger datasets, spaCy can be significantly faster.

import pandas as pd
import numpy as np
import plotly.express as px
import spacy
from collections import Counter

NLP with spaCy using a String

# Load spaCy English model
# Make sure you've run: python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

Here is a sample text that we will process using spaCy:

original_text = "Celebrating 10 years with Mastercard has been an incredible journey - great benefits, flexible hours, and amazing colleagues!"

A nlp object is created using the spacy.load() function, which loads a pre-trained language model. In this case, we are using the English model en_core_web_sm. The text is then processed using the nlp() function, which creates a Doc object containing tokens and their linguistic features.

# Create a spaCy Doc
doc = nlp(original_text)

# Check the type of doc
type(doc)
spacy.tokens.doc.Doc

We can tokenize the text, convert it to lowercase, lemmatize the tokens, remove special characters, and eliminate stopwords. Although this tutorial intentionally breaks down each step for clarity, in practice, these steps can be combined into a single processing pipeline for efficiency.

Tokenization

Tokenization is the process of breaking down text into smaller units called tokens, which can be words, phrases, or symbols. In this case, we are tokenizing the text using spaCy’s Doc object, which allows us to easily access and manipulate the tokens.

# Tokenization
tokens = [token.text for token in doc]

tokens[:5]
['Celebrating', '10', 'years', 'with', 'Mastercard']

Lowercasing

We convert all tokens to lowercase to ensure uniformity. This helps in reducing the number of unique tokens, as “The” and “the” will be treated as the same token. Note that we are using Python’s built-in lower() method for strings.

# Lowercasing
lower_tokens = [t.lower() for t in tokens]

lower_tokens[:5]
['celebrating', '10', 'years', 'with', 'mastercard']

Lemmatization

Lemmatization is the process of reducing words to their base or root form, known as the lemma. This helps in normalizing words and reducing the number of unique tokens. For example, “running” becomes “run”, and “tasks” becomes “task”. In this case, we are using spaCy’s built-in lemmatization capabilities to obtain the lemmas of the tokens.

The lemma_ attribute of each token in the Doc object provides the lemmatized form of the token.

# Lemmatization
lemmas = [token.lemma_ for token in doc]

lemmas[:5]
['celebrate', '10', 'year', 'with', 'Mastercard']

Stopword Removal

Stopwords are common words that do not contribute significantly to the meaning of the text. Examples include “the”, “is”, “and”, etc. Removing stopwords helps in reducing noise and focusing on the more meaningful words in the text.

spaCy provides a built-in attribute is_stop for each token, which indicates whether the token is a stopword. We can use this attribute to filter out stopwords from our list of tokens.

# Stopword & punctuation removal (lemmatized + lowercased)
clean_tokens = [
    token.lemma_.lower()
    for token in doc
    if not token.is_stop and not token.is_punct and not token.is_space
]
clean_tokens[:5]
['celebrate', '10', 'year', 'mastercard', 'incredible']
print("Original:", original_text)
print("Tokens:", tokens)
print("Lower tokens:", lower_tokens)
print("Lemmas:", lemmas)
print("Clean tokens (no stopwords/punct, lemmatized, lowercased):", clean_tokens)
Original: Celebrating 10 years with Mastercard has been an incredible journey - great benefits, flexible hours, and amazing colleagues!
Tokens: ['Celebrating', '10', 'years', 'with', 'Mastercard', 'has', 'been', 'an', 'incredible', 'journey', '-', 'great', 'benefits', ',', 'flexible', 'hours', ',', 'and', 'amazing', 'colleagues', '!']
Lower tokens: ['celebrating', '10', 'years', 'with', 'mastercard', 'has', 'been', 'an', 'incredible', 'journey', '-', 'great', 'benefits', ',', 'flexible', 'hours', ',', 'and', 'amazing', 'colleagues', '!']
Lemmas: ['celebrate', '10', 'year', 'with', 'Mastercard', 'have', 'be', 'an', 'incredible', 'journey', '-', 'great', 'benefit', ',', 'flexible', 'hour', ',', 'and', 'amazing', 'colleague', '!']
Clean tokens (no stopwords/punct, lemmatized, lowercased): ['celebrate', '10', 'year', 'mastercard', 'incredible', 'journey', 'great', 'benefit', 'flexible', 'hour', 'amazing', 'colleague']

NLP with spaCy using a DataFrame

We can apply the same NLP techniques to a pandas DataFrame containing multiple reviews.

Dataset

The dataset contains Glassdoor employee reviews for MasterCard. Each review has a unique review_id and multiple ratings and text fields. We will focus on the text fields, which contains text about what employees liked or disliked about working at MasterCard.

Some reviews may contain special characters, mixed casing, and stopwords, which we will clean using the NLP techniques mentioned above.

df = pd.read_csv(
    "https://raw.githubusercontent.com/bdi475/datasets/refs/heads/main/mastercard-glassdoor-reviews.csv"
)
df.head(3)
Loading...
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 22 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   review_id                         1000 non-null   int64  
 1   review_date                       1000 non-null   object 
 2   review_text                       389 non-null    object 
 3   review_liked_text                 1000 non-null   object 
 4   review_disliked_text              1000 non-null   object 
 5   count_helpful                     1000 non-null   int64  
 6   count_not_helpful                 1000 non-null   int64  
 7   employer_responses                38 non-null     object 
 8   is_current_job                    1000 non-null   bool   
 9   length_of_employment              1000 non-null   int64  
 10  rating_business_outlook           661 non-null    object 
 11  rating_career_opportunities       1000 non-null   float64
 12  rating_ceo                        685 non-null    object 
 13  rating_compensation_and_benefits  1000 non-null   float64
 14  rating_culture_and_values         1000 non-null   int64  
 15  rating_diversity_and_inclusion    1000 non-null   int64  
 16  rating_overall                    1000 non-null   int64  
 17  rating_recommend_to_friend        714 non-null    object 
 18  rating_senior_leadership          1000 non-null   float64
 19  rating_work_life_balance          1000 non-null   float64
 20  job_title_text                    871 non-null    object 
 21  location_name                     803 non-null    object 
dtypes: bool(1), float64(4), int64(7), object(10)
memory usage: 165.2+ KB

There are 1000 reviews in total.

df.shape
(1000, 22)

What did employees like about working at MasterCard?

token_lists = []

# parser is the dependency parser
# ner is the named entity recognizer
for doc in nlp.pipe(df["review_liked_text"], disable=["parser", "ner"]):
    token_lists.append(
        [
            token.lemma_.lower()
            for token in doc
            if not token.is_stop and not token.is_punct and not token.is_space
        ]
    )

df["review_liked_tokens"] = token_lists

df[["review_id", "review_liked_text", "review_liked_tokens"]].head()
Loading...
# Explode liked_tokens so each token becomes its own row
df_liked_exploded = df[["review_id", "review_liked_tokens"]].explode("review_liked_tokens").reset_index(drop=True)
df_liked_exploded = df_liked_exploded.rename(columns={"review_liked_tokens": "token"})

# Inspect result
df_liked_exploded.head()
Loading...

After processing the “liked” column, we explode the list of tokens so that each token becomes its own row in the DataFrame. We also clean the tokens by dropping any missing or empty tokens and trimming whitespace.

df_liked_exploded.shape
(12919, 2)

The most common tokens in the “liked” reviews can be identified by counting the occurrences of each token in the exploded DataFrame.

df_liked_common_tokens = df_liked_exploded["token"].value_counts().to_frame().reset_index().head(30)
df_liked_common_tokens
Loading...

What did employees dislike about working at MasterCard?

We can repeat the same process for the disliked column to identify the most common tokens in that column as well.

token_lists = []

# parser is the dependency parser
# ner is the named entity recognizer
for doc in nlp.pipe(df["review_disliked_text"], disable=["parser", "ner"]):
    token_lists.append(
        [
            token.lemma_.lower()
            for token in doc
            if not token.is_stop and not token.is_punct and not token.is_space
        ]
    )

df["review_disliked_tokens"] = token_lists

df[["review_id", "review_disliked_text", "review_disliked_tokens"]].head()
Loading...
# Explode disliked_tokens so each token becomes its own row
df_disliked_exploded = df[["review_id", "review_disliked_tokens"]].explode("review_disliked_tokens").reset_index(drop=True)
df_disliked_exploded = df_disliked_exploded.rename(columns={"review_disliked_tokens": "token"})

# Inspect result
df_disliked_exploded.head()
Loading...
df_disliked_common_tokens = df_disliked_exploded["token"].value_counts().to_frame().reset_index().head(30)
df_disliked_common_tokens
Loading...