Natural Language Processing (NLP) is a field of computer science that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to enable computers to understand, interpret, and generate human language in a way that is valuable.
Although NLP includes a wide range of techniques and applications, some of the most common tasks include:
Tokenization: Breaking down text into smaller units, such as words or sentences.
Lowercasing: Converting all characters in the text to lowercase to ensure uniformity.
Lemmatization: Reducing words to their base or root form:
Example: “running” becomes “run”
Example: “tasks” becomes “task”
Special Character Removal: Stripping out punctuation, numbers, and other non-alphabetic characters from the text.
Stopword Removal: Eliminating common words (e.g., “the”, “is”, “and”) that do not contribute significantly to the meaning of the text.
Import Packages¶
spaCy is a NLP library in Python that provides tools for tokenization, lemmatization, and more. You may have used nltk or textblob before, but spaCy is known for its speed and efficiency. For small tasks like this, you will not notice much difference, but for larger datasets, spaCy can be significantly faster.
import pandas as pd
import numpy as np
import plotly.express as px
import spacy
from collections import CounterNLP with spaCy using a String¶
# Load spaCy English model
# Make sure you've run: python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")Here is a sample text that we will process using spaCy:
original_text = "Celebrating 10 years with Mastercard has been an incredible journey - great benefits, flexible hours, and amazing colleagues!"A nlp object is created using the spacy.load() function, which loads a pre-trained language model. In this case, we are using the English model en_core_web_sm. The text is then processed using the nlp() function, which creates a Doc object containing tokens and their linguistic features.
# Create a spaCy Doc
doc = nlp(original_text)
# Check the type of doc
type(doc)spacy.tokens.doc.DocWe can tokenize the text, convert it to lowercase, lemmatize the tokens, remove special characters, and eliminate stopwords. Although this tutorial intentionally breaks down each step for clarity, in practice, these steps can be combined into a single processing pipeline for efficiency.
Tokenization¶
Tokenization is the process of breaking down text into smaller units called tokens, which can be words, phrases, or symbols. In this case, we are tokenizing the text using spaCy’s Doc object, which allows us to easily access and manipulate the tokens.
# Tokenization
tokens = [token.text for token in doc]
tokens[:5]['Celebrating', '10', 'years', 'with', 'Mastercard']Lowercasing¶
We convert all tokens to lowercase to ensure uniformity. This helps in reducing the number of unique tokens, as “The” and “the” will be treated as the same token. Note that we are using Python’s built-in lower() method for strings.
# Lowercasing
lower_tokens = [t.lower() for t in tokens]
lower_tokens[:5]['celebrating', '10', 'years', 'with', 'mastercard']Lemmatization¶
Lemmatization is the process of reducing words to their base or root form, known as the lemma. This helps in normalizing words and reducing the number of unique tokens. For example, “running” becomes “run”, and “tasks” becomes “task”. In this case, we are using spaCy’s built-in lemmatization capabilities to obtain the lemmas of the tokens.
The lemma_ attribute of each token in the Doc object provides the lemmatized form of the token.
# Lemmatization
lemmas = [token.lemma_ for token in doc]
lemmas[:5]['celebrate', '10', 'year', 'with', 'Mastercard']Stopword Removal¶
Stopwords are common words that do not contribute significantly to the meaning of the text. Examples include “the”, “is”, “and”, etc. Removing stopwords helps in reducing noise and focusing on the more meaningful words in the text.
spaCy provides a built-in attribute is_stop for each token, which indicates whether the token is a stopword. We can use this attribute to filter out stopwords from our list of tokens.
# Stopword & punctuation removal (lemmatized + lowercased)
clean_tokens = [
token.lemma_.lower()
for token in doc
if not token.is_stop and not token.is_punct and not token.is_space
]
clean_tokens[:5]['celebrate', '10', 'year', 'mastercard', 'incredible']print("Original:", original_text)
print("Tokens:", tokens)
print("Lower tokens:", lower_tokens)
print("Lemmas:", lemmas)
print("Clean tokens (no stopwords/punct, lemmatized, lowercased):", clean_tokens)Original: Celebrating 10 years with Mastercard has been an incredible journey - great benefits, flexible hours, and amazing colleagues!
Tokens: ['Celebrating', '10', 'years', 'with', 'Mastercard', 'has', 'been', 'an', 'incredible', 'journey', '-', 'great', 'benefits', ',', 'flexible', 'hours', ',', 'and', 'amazing', 'colleagues', '!']
Lower tokens: ['celebrating', '10', 'years', 'with', 'mastercard', 'has', 'been', 'an', 'incredible', 'journey', '-', 'great', 'benefits', ',', 'flexible', 'hours', ',', 'and', 'amazing', 'colleagues', '!']
Lemmas: ['celebrate', '10', 'year', 'with', 'Mastercard', 'have', 'be', 'an', 'incredible', 'journey', '-', 'great', 'benefit', ',', 'flexible', 'hour', ',', 'and', 'amazing', 'colleague', '!']
Clean tokens (no stopwords/punct, lemmatized, lowercased): ['celebrate', '10', 'year', 'mastercard', 'incredible', 'journey', 'great', 'benefit', 'flexible', 'hour', 'amazing', 'colleague']
NLP with spaCy using a DataFrame¶
We can apply the same NLP techniques to a pandas DataFrame containing multiple reviews.
Dataset¶
The dataset contains Glassdoor employee reviews for MasterCard. Each review has a unique review_id and multiple ratings and text fields. We will focus on the text fields, which contains text about what employees liked or disliked about working at MasterCard.
Some reviews may contain special characters, mixed casing, and stopwords, which we will clean using the NLP techniques mentioned above.
df = pd.read_csv(
"https://raw.githubusercontent.com/bdi475/datasets/refs/heads/main/mastercard-glassdoor-reviews.csv"
)
df.head(3)df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review_id 1000 non-null int64
1 review_date 1000 non-null object
2 review_text 389 non-null object
3 review_liked_text 1000 non-null object
4 review_disliked_text 1000 non-null object
5 count_helpful 1000 non-null int64
6 count_not_helpful 1000 non-null int64
7 employer_responses 38 non-null object
8 is_current_job 1000 non-null bool
9 length_of_employment 1000 non-null int64
10 rating_business_outlook 661 non-null object
11 rating_career_opportunities 1000 non-null float64
12 rating_ceo 685 non-null object
13 rating_compensation_and_benefits 1000 non-null float64
14 rating_culture_and_values 1000 non-null int64
15 rating_diversity_and_inclusion 1000 non-null int64
16 rating_overall 1000 non-null int64
17 rating_recommend_to_friend 714 non-null object
18 rating_senior_leadership 1000 non-null float64
19 rating_work_life_balance 1000 non-null float64
20 job_title_text 871 non-null object
21 location_name 803 non-null object
dtypes: bool(1), float64(4), int64(7), object(10)
memory usage: 165.2+ KB
There are 1000 reviews in total.
df.shape(1000, 22)What did employees like about working at MasterCard?¶
token_lists = []
# parser is the dependency parser
# ner is the named entity recognizer
for doc in nlp.pipe(df["review_liked_text"], disable=["parser", "ner"]):
token_lists.append(
[
token.lemma_.lower()
for token in doc
if not token.is_stop and not token.is_punct and not token.is_space
]
)
df["review_liked_tokens"] = token_lists
df[["review_id", "review_liked_text", "review_liked_tokens"]].head()# Explode liked_tokens so each token becomes its own row
df_liked_exploded = df[["review_id", "review_liked_tokens"]].explode("review_liked_tokens").reset_index(drop=True)
df_liked_exploded = df_liked_exploded.rename(columns={"review_liked_tokens": "token"})
# Inspect result
df_liked_exploded.head()After processing the “liked” column, we explode the list of tokens so that each token becomes its own row in the DataFrame. We also clean the tokens by dropping any missing or empty tokens and trimming whitespace.
df_liked_exploded.shape(12919, 2)The most common tokens in the “liked” reviews can be identified by counting the occurrences of each token in the exploded DataFrame.
df_liked_common_tokens = df_liked_exploded["token"].value_counts().to_frame().reset_index().head(30)
df_liked_common_tokensWhat did employees dislike about working at MasterCard?¶
We can repeat the same process for the disliked column to identify the most common tokens in that column as well.
token_lists = []
# parser is the dependency parser
# ner is the named entity recognizer
for doc in nlp.pipe(df["review_disliked_text"], disable=["parser", "ner"]):
token_lists.append(
[
token.lemma_.lower()
for token in doc
if not token.is_stop and not token.is_punct and not token.is_space
]
)
df["review_disliked_tokens"] = token_lists
df[["review_id", "review_disliked_text", "review_disliked_tokens"]].head()# Explode disliked_tokens so each token becomes its own row
df_disliked_exploded = df[["review_id", "review_disliked_tokens"]].explode("review_disliked_tokens").reset_index(drop=True)
df_disliked_exploded = df_disliked_exploded.rename(columns={"review_disliked_tokens": "token"})
# Inspect result
df_disliked_exploded.head()df_disliked_common_tokens = df_disliked_exploded["token"].value_counts().to_frame().reset_index().head(30)
df_disliked_common_tokens