Enalysis
Enalysis

Reputation: 11

How can I make my Python NLTK pre-processing code more efficient?

I am trying to import Yelp reviews and pre-process the text data using Python so I can find most frequently used nouns in the reviews and in turn extract informative aspects. I have come up with the following code and wanted someone to let me know if there is more efficient way to writing this code for the purpose:

    import pandas as pd
    import nltk
    import os

    # Import data files
    path='~\Revsfile'
    filename='blrevs.csv'

    os.chdir(path)
    df1=pd.read_csv(filename, encoding="utf-8") # Set encoding to assist with sent_tokenize command later on
    df2=df1[['id','brand','Rating','Description']] # Description includes the review text

    # Remove missing characters
    df2['Description']=df2['Description'].fillna('')

    # Preprocess text data and tokenize words
import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re

stops = set(stopwords.words("english"))

def preprocess(sentence):
	sentence = sentence.lower()
	tokenizer = RegexpTokenizer(r'\w+')
	tokens = tokenizer.tokenize(sentence)
	filtered_words = [w for w in tokens if not w in stops]
	return " ".join(filtered_words)

df2['tokenized_words']=df2['Description'].apply(preprocess)

Upvotes: 0

Views: 689

Answers (1)

alexis
alexis

Reputation: 50190

The message you see comes up all too often when you work with a dataframe. It means pandas is not sure your operation is safe, but it is not sure that it is a problem, either. Figure out a work-around to be safe, but that's not the source of your performance problem.

I didn't profile your code, but your clean_up() function in particular is horrible. Why do you keep splitting, processing and re-joining? Tokenize once, filter the tokens, then join the final result if you must.

In addition to the redundant splitting-joining, you do it inefficiently by needlessly building a temporary array that you pass to join(). Use a generator instead (i.e., leave out the square brackets) and your performance should improve dramatically. For example, instead of ''.join([singularize(plural) for plural in s]) you can write:

s = ''.join(singularize(plural) for plural in s)

I can't go into more detail because to be frank, your tokenization is a mess. When and how will you apply sent_tokenize(), after you've removed the punctuation? Also the line I rewrote above is (and was) trying to "singularize" indidivual letters, if I'm not mistaken. Think more carefully about what you're doing, work with tokens as I recommended (consider using nltk.word_tokenize()-- but it's not as fast as a single split()), and inspect the intermediate steps.

Upvotes: 1

Related Questions