Reputation: 11
I am trying to import Yelp reviews and pre-process the text data using Python so I can find most frequently used nouns in the reviews and in turn extract informative aspects. I have come up with the following code and wanted someone to let me know if there is more efficient way to writing this code for the purpose:
import pandas as pd
import nltk
import os
# Import data files
path='~\Revsfile'
filename='blrevs.csv'
os.chdir(path)
df1=pd.read_csv(filename, encoding="utf-8") # Set encoding to assist with sent_tokenize command later on
df2=df1[['id','brand','Rating','Description']] # Description includes the review text
# Remove missing characters
df2['Description']=df2['Description'].fillna('')
# Preprocess text data and tokenize words
import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
stops = set(stopwords.words("english"))
def preprocess(sentence):
sentence = sentence.lower()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(sentence)
filtered_words = [w for w in tokens if not w in stops]
return " ".join(filtered_words)
df2['tokenized_words']=df2['Description'].apply(preprocess)
Upvotes: 0
Views: 689
Reputation: 50190
The message you see comes up all too often when you work with a dataframe. It means pandas is not sure your operation is safe, but it is not sure that it is a problem, either. Figure out a work-around to be safe, but that's not the source of your performance problem.
I didn't profile your code, but your clean_up()
function in particular is horrible. Why do you keep splitting, processing and re-joining? Tokenize once, filter the tokens, then join the final result if you must.
In addition to the redundant splitting-joining, you do it inefficiently by needlessly building a temporary array that you pass to join()
. Use a generator instead (i.e., leave out the square brackets) and your performance should improve dramatically. For example, instead of ''.join([singularize(plural) for plural in s])
you can write:
s = ''.join(singularize(plural) for plural in s)
I can't go into more detail because to be frank, your tokenization is a mess. When and how will you apply sent_tokenize()
, after you've removed the punctuation? Also the line I rewrote above is (and was) trying to "singularize" indidivual letters, if I'm not mistaken. Think more carefully about what you're doing, work with tokens as I recommended (consider using nltk.word_tokenize()
-- but it's not as fast as a single split()
), and inspect the intermediate steps.
Upvotes: 1