How to optimize my pandas data frame pre-processing?

Question

I have a pandas dataframe with several hundred thousand rows and a column df['reviews'] within which are text reviews of a product. I am cleaning the data, but pre-processing is taking a long time. Could you please offer suggestions on how to optimize my code? Thanks in advance.

# import useful libraries
import pandas as pd
from langdetect import detect
import nltk
from html2text import unescape
from nltk.corpus import stopwords

# define corpus
words = set(nltk.corpus.words.words())

# define stopwords
stop = stopwords.words('english')
newStopWords = ['oz','stopWord2']
stop.extend(newStopWords)

# read csv into dataframe
df=pd.read_csv('./data.csv')

# unescape reviews (fix html encoding)
df['clean_reviews'] = df['reviews'].apply(unescape, unicode_snob=True)

# remove non-ASCII characters
df['clean_reviews'] = df["clean_reviews"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

# calculate number of stop words in raw reviews
df['stopwords'] = df['reviews'].apply(lambda x: len([x for x in x.split() if x in stop]))

# lowercase reviews
df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# add a space before and after every punctuation 
df['clean_reviews'] = df['clean_reviews'].str.replace(r'([^\w\s]+)', ' \1 ')

# remove punctuation
df['clean_reviews'] = df['clean_reviews'].str.replace('[^\w\s]','')

# remove stopwords
df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

# remove digits
df['clean_reviews'] = df['clean_reviews'].str.replace('\d+', '')

# remove non-corpus words
def remove_noncorpus(sentence):
    print(sentence)
    return " ".join(w for w in nltk.wordpunct_tokenize(sentence) if w.lower() in words or not w.isalpha())

df['clean_reviews'] = df['clean_reviews'].map(remove_noncorpus)

# count number of characters
df['character_count'] = df['clean_reviews'].apply(len)

# count number of words
df['word_count'] = df['clean_reviews'].str.split().str.len()

# average word length
def avg_word(sentence):
  words = sentence.split()
  print(sentence)
  return (sum(len(word) for word in words)/len(words))

df['avg_word'] = df['clean_reviews'].apply(lambda x: avg_word(x))
df[['clean_reviews','avg_word']].head()

# detect language of reviews
df['language'] = df['clean_reviews'].apply(detect)

# filter out non-English reviews
msk = (df['language'] == 'en')
df_range = df[msk]

# write dataframe to csv
df_range.to_csv('dataclean.csv', index=False)

The code posted above does everything that I need it to; however, it takes hours to finish. I would appreciate any helpful suggestions on how to cut back processing time. Please let me know if you need any other details.

Stef · Accepted Answer

1) How to find the most time consuming part(s) of the program

First you'll have to see where most of the time is spend in your program. This can be done 'manually', as already noted in the comments above, by inserting print()s after each step to give you a visual impression of the program progress. To get quantitative results you could wrap each step in start = time.time() and print('myProgramStep: {}'.format(time.time() - start)) calls. This is OK as long as your program is relatively short, otherwise this becomes rather arduous.

The best way is using a profiler. Python comes with a built-in profiler, but its a bit cumbersome to use: First we profile the program with cProfile and then load the profile for review with pstats:

python3 -m cProfile -o so57333255.py.prof so57333255.py
python3 -m pstats  so57333255.py.prof

Inside pstats we enter sort cumtime to sort it by time spent in a function and all functions called by it and stats 5 to show the top 5 entries:

         2351652 function calls (2335973 primitive calls) in 9.843 seconds

   Ordered by: cumulative time
   List reduced from 4964 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1373/1    0.145    0.000    9.852    9.852 {built-in method exec}
        1    0.079    0.079    9.852    9.852 so57333255.py:2()
        9    0.003    0.000    5.592    0.621 {pandas._libs.lib.map_infer}
        8    0.001    0.000    5.582    0.698 /usr/local/lib/python3.4/dist-packages/pandas/core/series.py:2230(apply)
      100    0.001    0.000    5.341    0.053 /usr/local/lib/python3.4/dist-packages/langdetect/detector_factory.py:126(detect)

From here we learn that the most expensive single function in your programm is apply, called 8 times - but we don't see from here whether the 8 calls took more or less the same amount of time each or if one took especially long. On the next line, however, we see detect with 5.341 s, i.e. most of the total 5.582 s for all 8 apply calls was spend on apply(detect). You can get further insights with the callers and callees commands, but as you see it is not very conventient.

A much more user friendly approach is line profiler. It profiles calls to functions with a @profile decorator, so we have to put our whole program in a function with the decorator and the call this function. Then we get the following result:

Total time: 8.59578 s
File: so57333255a.py
Function: runit at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           @profile
     9                                           def runit():
    10                                           
    11                                               # define corpus
    12         1     385710.0 385710.0      4.5      words = set(nltk.corpus.words.words())
    13                                           
    14                                               # define stopwords
    15         1       2068.0   2068.0      0.0      stop = stopwords.words('english')
    16         1         10.0     10.0      0.0      newStopWords = ['oz','stopWord2']
    17         1          9.0      9.0      0.0      stop.extend(newStopWords)
    18                                           
    19                                               # read csv into dataframe
    20         1      46880.0  46880.0      0.5      df=pd.read_csv('reviews.csv', names=['reviews'], header=None, nrows=100)
    21                                           
    22                                               # unescape reviews (fix html encoding)
    23         1      16922.0  16922.0      0.2      df['clean_reviews'] = df['reviews'].apply(unescape, unicode_snob=True)
    24                                           
    25                                               # remove non-ASCII characters
    26         1      15133.0  15133.0      0.2      df['clean_reviews'] = df["clean_reviews"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
    27                                           
    28                                               # calculate number of stop words in raw reviews
    29         1      20721.0  20721.0      0.2      df['stopwords'] = df['reviews'].apply(lambda x: len([x for x in x.split() if x in stop]))
    30                                           
    31                                               # lowercase reviews
    32         1       5325.0   5325.0      0.1      df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x.lower() for x in x.split()))
    33                                           
    34                                               # add a space before and after every punctuation 
    35         1       9834.0   9834.0      0.1      df['clean_reviews'] = df['clean_reviews'].str.replace(r'([^\w\s]+)', ' \1 ')
    36                                           
    37                                               # remove punctuation
    38         1       3262.0   3262.0      0.0      df['clean_reviews'] = df['clean_reviews'].str.replace('[^\w\s]','')
    39                                           
    40                                               # remove stopwords
    41         1      20259.0  20259.0      0.2      df['clean_reviews'] = df['clean_reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
    42                                           
    43                                               # remove digits
    44         1       2897.0   2897.0      0.0      df['clean_reviews'] = df['clean_reviews'].str.replace('\d+', '')
    45                                           
    46                                               # remove non-corpus words
    47         1          9.0      9.0      0.0      def remove_noncorpus(sentence):
    48                                                   #print(sentence)
    49                                                   return " ".join(w for w in nltk.wordpunct_tokenize(sentence) if w.lower() in words or not w.isalpha())
    50                                           
    51         1       6698.0   6698.0      0.1      df['clean_reviews'] = df['clean_reviews'].map(remove_noncorpus)
    52                                           
    53                                               # count number of characters
    54         1       1912.0   1912.0      0.0      df['character_count'] = df['clean_reviews'].apply(len)
    55                                           
    56                                               # count number of words
    57         1       3641.0   3641.0      0.0      df['word_count'] = df['clean_reviews'].str.split().str.len()
    58                                           
    59                                               # average word length
    60         1          9.0      9.0      0.0      def avg_word(sentence):
    61                                                 words = sentence.split()
    62                                                 #print(sentence)
    63                                                 return (sum(len(word) for word in words)/len(words)) if len(words)>0 else 0
    64                                           
    65         1       3445.0   3445.0      0.0      df['avg_word'] = df['clean_reviews'].apply(lambda x: avg_word(x))
    66         1       3786.0   3786.0      0.0      df[['clean_reviews','avg_word']].head()
    67                                           
    68                                               # detect language of reviews
    69         1    8037362.0 8037362.0     93.5      df['language'] = df['clean_reviews'].apply(detect)
    70                                           
    71                                               # filter out non-English reviews
    72         1       1453.0   1453.0      0.0      msk = (df['language'] == 'en')
    73         1       2353.0   2353.0      0.0      df_range = df[msk]
    74                                           
    75                                               # write dataframe to csv
    76         1       6087.0   6087.0      0.1      df_range.to_csv('dataclean.csv', index=False)

From here we see directly that 93.5 % of the total time is spent on df['language'] = df['clean_reviews'].apply(detect).
This is for my toy example with just 100 rows, for 5K rows it will be over 99 % of the time.

2) How to make it faster

So most of the time is spent on language detection. Details of the algorithm used by detect can be found here. It turns out that about 40 to 50 characters of a text are sufficient to determine the language, so if your reviews are much longer, you can save some time by applying detect not to the whole text but just the first 50 characters. Depending on the average length of your reviews this will bring a speed-up of a couple of percents.

As there's not much to optimize with the detect function, the only way is to replace it by something faster, e.g. Google's Compact Language Detector CLD2 or CLD3. I went for the latter and it turned out to be about a 100 times faster than detect. Another fast alternative is langid, its speed is compared to CLD2 in this paper.

How to optimize my pandas data frame pre-processing?

Answers (1)

1) How to find the most time consuming part(s) of the program

2) How to make it faster

Related Questions