In today’s post, I am going to be going over various text preprocessing techniques, such as stopword removal and lemmatization, using the NLTK and scikit-learn python packages.
What is text preprocessing?
Text preprocessing is an extremely important technique when dealing with Natural Language Processing problems. Preprocessing is used to prepare a block of text for a machine learning algorithm by removing unnecessary elements.
Take the following sentence as an example: “The house I am living at is at the address 123 My-Street Avenue in New York City.”
Throughout this blog post, we are going to perform different preprocessing techniques on this sentence
Removing unnecessary characters
First, we want to remove all of the unnecessary characters in the text. This includes punctuation, numbers, and uppercase characters. This is done to ensure that the machine learning algorithms do not give any bias or importance to a certain word or character because of its case. The following code can be used to remove those characters.
Code:
text = 'The house I am living at is at the address 123 My-Street Avenue in New York City.'
text = re.sub(r'[^\w\s]', '', text) # remove punctuation
text = re.sub(r'\d+','', text) # remove numbers
text = text.lower() # change text to lowercase
text = re.sub("\s\s+", " ", text) # remove excess spaces
print(text)
Output:
'the house i am living at is at the address mystreet avenue in new york city'
We successfully removed all the numbers, punctuation, and uppercase characters in our sentence.
Tokenization
Tokenization is the process of splitting the sentence into individual words, characters, or sequences called tokens. By splitting the sentence into tokens, we can find patterns in the words. We also open the door for various other preprocessing techniques such as lemmatization and stemming.
Let’s take a look at how to use the Natural Language Toolkit (NLTK) python package to tokenize our text.
Code:
from nltk.tokenize import word_tokenize
words = word_tokenize(text)
print(words)
Output:
['the', 'house', 'i', 'am', 'living', 'at', 'is', 'at', 'the', 'address', 'mystreet', 'avenue', 'in', 'new', 'york', 'city']
We successfully outputted an array of all of the words in our original sentence using the NLTK package.
The Keras package also has a Tokenizer class that has useful functions. With the Keras package, we can call the fit_on_texts() method as well as the texts_to_sequences() method to assign each word in the text a unique number. This allows our string of words to become an array of numbers. Let’s take a look at the code.
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
print('Dictionary: ', tokenizer.word_index)
print('Sequence: ', tokenizer.texts_to_sequences([text))
Output:
Dictionary: {'the': 1, 'at': 2, 'house': 3, 'i': 4, 'am': 5, 'living': 6, 'is': 7, 'address': 8, 'mystreet': 9, 'avenue': 10, 'in': 11, 'new': 12, 'york': 13, 'city': 14}
Sequence: [[1, 3, 4, 5, 6, 2, 7, 2, 1, 8, 9, 10, 11, 12, 13, 14]]
removing stopwords
Stopwords are words that are commonly used in a language but do not add any significant meaning to a sentence. For example, the word “the” is important when speaking or writing English; however, if the word was removed from the sentence, the meaning of the sentence would not be altered in a major way. Therefore, we are going to remove the stopwords from our sentence using the NLTK package again.
Code:
from nltk.corpus import stopwords
print('original: ', words)
words = [word for word in words if word not in stopwords.words('english')]
print('without stopwords: ', words)
Output:
original: ['the', 'house', 'i', 'am', 'living', 'at', 'is', 'at', 'the', 'address', 'mystreet', 'avenue', 'in', 'new', 'york', 'city']
without stopwords: ['house', 'living', 'address', 'mystreet', 'avenue', 'new', 'york', 'city']
Stemming and lemmatization
Stemming and lemmatization are two techniques that are used to reduce words to various forms. Stemming is the process of reducing a word to its stem word and lemmatization is the process of reducing a word to its root form. For example, take the word “studies.” If we used stemming on this word, we would get the stem “studi.” If we performed lemmatization, then we would get the lemma “study.” The difference between stemming and lemmatization is that a lemma is the root form of all extensions of the word. For example, performing lemmatization on “studying” and “studies” would both output “study.” On the other hand, stemming would output “studi” for “studies” and “study” for “studying.”
Stemming is computationally much less expensive than lemmatization because lemmatization requires the usage of lookup tables. If you are using a large dataset and are under time and resource constraints, stemming should be used. However, if accuracy is more important to you, lemmatization is the way to go.
Let’s look at the code for these two techniques.
Code:
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
stemmed_words = [PorterStemmer().stem(word) for word in words]
print('Stemmed Words: ', stemmed_words)
from nltk import pos_tag
parts_of_speech = pos_tag(words)
lemmas = []
for word, tag in parts_of_speech:
tag = tag[0].lower()
if tag not in ['a', 'r', 'n', 'v']:
tag = None
if not tag:
lemma = word
else:
lemma = WordNetLemmatizer().lemmatize(word, tag)
lemmas.append(lemma)
print('Lemmatized Words: ' ,lemmas)
Output:
Stemmed Words: ['hous', 'live', 'address', 'mystreet', 'avenu', 'new', 'york', 'citi']
Lemmatized Words: ['house', 'live', 'address', 'mystreet', 'avenue', 'new', 'york', 'city']
For the Lemmatizer, we need to get the associated part-of-speech for each word because the lemma changes based on the how the word is being used. That is why we imported the pos_tag function from NLTK.
We can see that the stemming process reduced the words “house” to “hous,” “living” to “live,” “avenue” to “avenu,” and “city” to “citi.” On the other hand, lemmatization only reduced the word “living” to “live.”
summary
Text preprocessing is an extremely important part of the NLP pipeline because it ensures that the data being passed through the ML algorithms or neural networks is cleaned up. The techniques I covered in this blog post are the ones that are most commonly used; however, there are some more techniques that you can find on the internet.