yearntolearn
yearntolearn

Reputation: 1074

Tokenizing a huge quantity of text in python

I have a huge list of text files to tokenize. I have the following code which works for a small dataset. I am having trouble using the same procedure with a huge dataset, however. I am giving the example of a small dataset as below.

In [1]: text = [["It works"], ["This is not good"]]

In [2]: tokens = [(A.lower().replace('.', '').split(' ') for A in L) for L in text]

In [3]: tokens
Out [3]: 
[<generator object <genexpr> at 0x7f67c2a703c0>,
<generator object <genexpr> at 0x7f67c2a70320>]

In [4]: list_tokens = [tokens[i].next() for i in range(len(tokens))]
In [5]: list_tokens
Out [5]:
[['it', 'works'], ['this', 'is', 'not', 'good']]

While all works so well with a small dataset, I encounter problem processing a huge list of lists of strings (more than 1,000,000 lists of strings) with the same code. As I still can tokenize the strings with the huge dataset as in In [3], it fails in In [4] (i.e. killed in terminal). I suspect it is just because the body of the text is too big.

I am here, therefore, seek for suggestions on the improvement of the procedure to obtain lists of strings in a list as what I have in In [5].

My actual purpose, however, is to count the words in each list. For instance, in the example of the small dataset above, I will have things as below.

[[0,0,1,0,0,1], [1, 1, 0, 1, 1, 0]] (note: each integer denotes the count of each word)

If I don't have to convert generators to lists to get the desired results (i.e. word counts), that would also be good.

Please let me know if my question is unclear. I would love to clarify as best as I can. Thank you.

Upvotes: 2

Views: 2818

Answers (3)

Jason
Jason

Reputation: 894

There are lots of available tokenizers that are optimized. I would look at CountVectorizer in sklearn, which is built for counting tokens.

Update September 2019: Use spaCy.

Upvotes: 2

Kalana
Kalana

Reputation: 6143

You have mentioned it is huge dataset. use this

In [1]: text = [["It works"], ["This is not good"]]

IN [2]:

processed_features = []

for sentence in range(0, len(text )):
   # remove all single characters
   processed_feature= re.sub(r'[^\w\s]', ' ', processed_feature)

   # Converting to Lowercase
   processed_feature = processed_feature.lower()

   processed_features.append(processed_feature)

processed_features

#['it works', 'this is not good']

IN [3]:

import nltk
new_contents = []
for i in range(0, len(processed_features)):
    new_content = nltk.tokenize.TreebankWordTokenizer().tokenize(processed_features[i])
    new_contents.append(new_content)

new_contents 

#[['it', 'works'], ['this', 'is', 'not', 'good']


Upvotes: 1

beroe
beroe

Reputation: 12316

You could create a set of unique words, then loop through and count each of those...

#! /usr/bin/env python

text = [["It works works"], ["It is not good this"]]

SplitList   = [x[0].split(" ") for x in text]
FlattenList = sum(SplitList,[])  # "trick" to flatten a list
UniqueList  = list(set(FlattenList))
CountMatrix = [[x.count(y) for y in UniqueList] for x in SplitList]

print UniqueList
print CountMatrix

Output is the total list of words, and their counts in each string:

['good', 'this', 'is', 'It', 'not', 'works']
[[0, 0, 0, 1, 0, 2], [1, 1, 1, 1, 1, 0]]

Upvotes: 3

Related Questions