TAN-C-F-OK
TAN-C-F-OK

Reputation: 179

Speed up removing stopwords from huge csv-file

Is there a better (faster) way to remove stopwords from a csv file?

Here is the simple code, and over an hour later I am still waiting for the results (so I don't even know if it's actually working):

import nltk
from nltk.corpus import stopwords
import csv
import codecs

f = codecs.open("agenericcsvfile.csv","r","utf-8")
readit = f.read()
f.close()

filtered = [w for w in readit if not w in stopwords.words('english')]

The csv-file has 50.000 rows and a total of ~15 million words. Why does it take so long? Sadly, this is only a subcorpora. I will have to do this with over 1 million rows and over 300 million words. So is there a way to speed things up? Or a more elegant code?

CSV-file sample:

1 text,sentiment
2 Loosely based on The Decameron, Jeff Baena's subversive film takes us behind the walls of a 13th century convent and squarely in the midst of a trio of lustful sisters, Alessandra (Alison Brie), Fernanda (Aubrey Plaza), and Ginerva (Kate Micucci) who are "beguiled" by a new handyman, Massetto (Dave Franco). He is posing as a deaf [...] and it is coming undone from all of these farcical complications.,3
3 One might recommend this film to the most liberally-minded of individuals, but even that is questionable as [...] But if you are one of the ribald loving few, who likes their raunchy hi-jinks with a satirical sting, this is your kinda movie. For me, the satire was lost.,5
4 [...]
[...]
50.000 The movie is [...] tht is what I ahve to say.,9

The desired output would be the same csv-file without the stop-words.

Upvotes: 1

Views: 1535

Answers (2)

tobias_k
tobias_k

Reputation: 82929

It seems like the stop words returned by NLTK are a list, thus having O(n) lookup. Convert the list to a set first, then it will be much faster.

>>> some_word = "aren't"
>>> stop = stopwords.words('english')
>>> type(stop)
list
>>> %timeit some_word in stop
1000000 loops, best of 3: 1.3 µs per loop

>>> stop = set(stopwords.words('english'))
>>> %timeit some_word in stop
10000000 loops, best of 3: 43.8 ns per loop

However, while this should solve the performanc problem, it seems like your code is not doing what you expect it to do in the first place. readit is a single string holding the content of the entire file, thus you are iterating characters and not words. You import the csv module, but you never use it. Also, the strings in your csv file should be quoted, otherwise it will split at all ,, not just at the last one. If you can not change the csv file, it might be easier to use str.rsplit though.

texts = [line.rsplit(",", 1)[0] for line in readit.splitlines()]
filtered = [[w for w in text.split() if w.lower() not in stopwords_set]
            for text in texts]

Upvotes: 1

bruno desthuilliers
bruno desthuilliers

Reputation: 77912

The first obvious optimization would be to 1/ avoid calling stopwords.words() in each iteration and 2/ make it a set (set lookup is O(1) where list lookup is O(N)):

words = set(stopwords.words("english"))
filtered = [w for w in readit if not w in words]

but this will not yield the expected results since readit is a string so you are actually iterating on individual characters, not words. You need to tokenize your string before, [as explained here][1]:

from nltk.tokenize import word_tokenize
readit = word_tokenize(readit)
# now readit is a proper list of words...
filtered = [w for w in readit if not w in words]

but now you've lost all the csv newlines, so you cannot properly rebuild it... And you might have some issues with quoting too if there's any quoting in your csv. So actually you may want to properly parse your source with a csv.reader and clean up your data field by field, row by row, which will of course add quite some overhead. Well, if your goal is to rebuild the csv without the stopwords, that is (else you may not care that much).

Anwyay: if you have a really huge corpus to cleanup and need performances, the next step is really parallelization: split the source data in parts, send each part to a distinct process (one per processor/core is a good start), possibly distributed over many computers, and collect the result. This pattern is known as "map reduce" and their are a couple Python implementations already.

Upvotes: 4

Related Questions