Reputation: 101
I have a pandas dataframe with several hundred thousand rows and a column df['reviews'] within which are text reviews of a product. I am cleaning the data, but pre-processing is taking a long time. Could you please offer suggestions on how to optimize my code? Thanks in advance.
# import useful libraries
import pandas as pd
from langdetect import detect
import nltk
from html2text import unescape
from nltk.corpus import stopwords
# define corpus
words = set(nltk.corpus.words.words())
# define stopwords
stop = stopwords.words('english')
newStopWords = ['oz','stopWord2']
stop.extend(newStopWords)
# read csv into dataframe
df=pd.read_csv('./data.csv')
# unescape reviews (fix html encoding)
df['clean_reviews'] = df['reviews'].apply(unescape, unicode_snob=True)
# remove non-ASCII characters
df['clean_reviews'] = df["clean_reviews"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
# calculate number of stop words in raw reviews
df['stopwords'] = df['reviews'].apply(lambda x: len([x for x in x.split() if x in stop]))
# lowercase reviews
df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))
# add a space before and after every punctuation
df['clean_reviews'] = df['clean_reviews'].str.replace(r'([^\w\s]+)', ' \\1 ')
# remove punctuation
df['clean_reviews'] = df['clean_reviews'].str.replace('[^\w\s]','')
# remove stopwords
df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
# remove digits
df['clean_reviews'] = df['clean_reviews'].str.replace('\d+', '')
# remove non-corpus words
def remove_noncorpus(sentence):
print(sentence)
return " ".join(w for w in nltk.wordpunct_tokenize(sentence) if w.lower() in words or not w.isalpha())
df['clean_reviews'] = df['clean_reviews'].map(remove_noncorpus)
# count number of characters
df['character_count'] = df['clean_reviews'].apply(len)
# count number of words
df['word_count'] = df['clean_reviews'].str.split().str.len()
# average word length
def avg_word(sentence):
words = sentence.split()
print(sentence)
return (sum(len(word) for word in words)/len(words))
df['avg_word'] = df['clean_reviews'].apply(lambda x: avg_word(x))
df[['clean_reviews','avg_word']].head()
# detect language of reviews
df['language'] = df['clean_reviews'].apply(detect)
# filter out non-English reviews
msk = (df['language'] == 'en')
df_range = df[msk]
# write dataframe to csv
df_range.to_csv('dataclean.csv', index=False)
The code posted above does everything that I need it to; however, it takes hours to finish. I would appreciate any helpful suggestions on how to cut back processing time. Please let me know if you need any other details.
Upvotes: 0
Views: 971
Reputation: 30609
First you'll have to see where most of the time is spend in your program. This can be done 'manually', as already noted in the comments above, by inserting print()
s after each step to give you a visual impression of the program progress. To get quantitative results you could wrap each step in start = time.time()
and print('myProgramStep: {}'.format(time.time() - start))
calls. This is OK as long as your program is relatively short, otherwise this becomes rather arduous.
The best way is using a profiler. Python comes with a built-in profiler, but its a bit cumbersome to use:
First we profile the program with cProfile
and then load the profile for review with pstats
:
python3 -m cProfile -o so57333255.py.prof so57333255.py
python3 -m pstats so57333255.py.prof
Inside pstats
we enter sort cumtime
to sort it by time spent in a function and all functions called by it and stats 5
to show the top 5 entries:
2351652 function calls (2335973 primitive calls) in 9.843 seconds
Ordered by: cumulative time
List reduced from 4964 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function)
1373/1 0.145 0.000 9.852 9.852 {built-in method exec}
1 0.079 0.079 9.852 9.852 so57333255.py:2(<module>)
9 0.003 0.000 5.592 0.621 {pandas._libs.lib.map_infer}
8 0.001 0.000 5.582 0.698 /usr/local/lib/python3.4/dist-packages/pandas/core/series.py:2230(apply)
100 0.001 0.000 5.341 0.053 /usr/local/lib/python3.4/dist-packages/langdetect/detector_factory.py:126(detect)
From here we learn that the most expensive single function in your programm is apply
, called 8 times - but we don't see from here whether the 8 calls took more or less the same amount of time each or if one took especially long. On the next line, however, we see detect
with 5.341 s, i.e. most of the total 5.582 s for all 8 apply
calls was spend on apply(detect)
. You can get further insights with the callers
and callees
commands, but as you see it is not very conventient.
A much more user friendly approach is line profiler. It profiles calls to functions with a @profile
decorator, so we have to put our whole program in a function with the decorator and the call this function. Then we get the following result:
Total time: 8.59578 s
File: so57333255a.py
Function: runit at line 8
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 @profile
9 def runit():
10
11 # define corpus
12 1 385710.0 385710.0 4.5 words = set(nltk.corpus.words.words())
13
14 # define stopwords
15 1 2068.0 2068.0 0.0 stop = stopwords.words('english')
16 1 10.0 10.0 0.0 newStopWords = ['oz','stopWord2']
17 1 9.0 9.0 0.0 stop.extend(newStopWords)
18
19 # read csv into dataframe
20 1 46880.0 46880.0 0.5 df=pd.read_csv('reviews.csv', names=['reviews'], header=None, nrows=100)
21
22 # unescape reviews (fix html encoding)
23 1 16922.0 16922.0 0.2 df['clean_reviews'] = df['reviews'].apply(unescape, unicode_snob=True)
24
25 # remove non-ASCII characters
26 1 15133.0 15133.0 0.2 df['clean_reviews'] = df["clean_reviews"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
27
28 # calculate number of stop words in raw reviews
29 1 20721.0 20721.0 0.2 df['stopwords'] = df['reviews'].apply(lambda x: len([x for x in x.split() if x in stop]))
30
31 # lowercase reviews
32 1 5325.0 5325.0 0.1 df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))
33
34 # add a space before and after every punctuation
35 1 9834.0 9834.0 0.1 df['clean_reviews'] = df['clean_reviews'].str.replace(r'([^\w\s]+)', ' \\1 ')
36
37 # remove punctuation
38 1 3262.0 3262.0 0.0 df['clean_reviews'] = df['clean_reviews'].str.replace('[^\w\s]','')
39
40 # remove stopwords
41 1 20259.0 20259.0 0.2 df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
42
43 # remove digits
44 1 2897.0 2897.0 0.0 df['clean_reviews'] = df['clean_reviews'].str.replace('\d+', '')
45
46 # remove non-corpus words
47 1 9.0 9.0 0.0 def remove_noncorpus(sentence):
48 #print(sentence)
49 return " ".join(w for w in nltk.wordpunct_tokenize(sentence) if w.lower() in words or not w.isalpha())
50
51 1 6698.0 6698.0 0.1 df['clean_reviews'] = df['clean_reviews'].map(remove_noncorpus)
52
53 # count number of characters
54 1 1912.0 1912.0 0.0 df['character_count'] = df['clean_reviews'].apply(len)
55
56 # count number of words
57 1 3641.0 3641.0 0.0 df['word_count'] = df['clean_reviews'].str.split().str.len()
58
59 # average word length
60 1 9.0 9.0 0.0 def avg_word(sentence):
61 words = sentence.split()
62 #print(sentence)
63 return (sum(len(word) for word in words)/len(words)) if len(words)>0 else 0
64
65 1 3445.0 3445.0 0.0 df['avg_word'] = df['clean_reviews'].apply(lambda x: avg_word(x))
66 1 3786.0 3786.0 0.0 df[['clean_reviews','avg_word']].head()
67
68 # detect language of reviews
69 1 8037362.0 8037362.0 93.5 df['language'] = df['clean_reviews'].apply(detect)
70
71 # filter out non-English reviews
72 1 1453.0 1453.0 0.0 msk = (df['language'] == 'en')
73 1 2353.0 2353.0 0.0 df_range = df[msk]
74
75 # write dataframe to csv
76 1 6087.0 6087.0 0.1 df_range.to_csv('dataclean.csv', index=False)
From here we see directly that 93.5 % of the total time is spent on df['language'] = df['clean_reviews'].apply(detect)
.
This is for my toy example with just 100 rows, for 5K rows it will be over 99 % of the time.
So most of the time is spent on language detection. Details of the algorithm used by detect
can be found here. It turns out that about 40 to 50 characters of a text are sufficient to determine the language, so if your reviews are much longer, you can save some time by applying detect
not to the whole text but just the first 50 characters. Depending on the average length of your reviews this will bring a speed-up of a couple of percents.
As there's not much to optimize with the detect
function, the only way is to replace it by something faster, e.g. Google's Compact Language Detector CLD2 or CLD3. I went for the latter and it turned out to be about a 100 times faster than detect
. Another fast alternative is langid
, its speed is compared to CLD2 in this paper.
Upvotes: 3