Reputation: 660
From the nltk "How To" guides, I know I can use Python to find the top x number of bigrams/trigrams in a file using something like this:
>>> import nltk
>>> from nltk.collocations import *
.....
>>> text = inputFile.read()
>>> tokens = nltk.wordpunct_tokenize(text)
>>> bigram_measures = nltk.collocations.BigramAssocMeasures()
>>> finder = BigramCollocationFinder.from_documents(filename)
>>> finder.nbest(bigram_measures.pmi, 10)
The problem with this is that I have to load the file into memory, which only works at the moment since I have split up the text I need into multiple smaller chunks. I definitely don't have enough memory to combine all the files into one single file or into a string for searching (total size is ~25GB). So if I want to search for the top X number of bigrams, I have to do it by file, but then I'll run into the problem of bigrams being repeated in my output. I'll also be missing out on other bigrams that collectively appear in the top X total of bigrams but don't make the cut in my other files.
Is there any way to use the nltk library to accomplish this or is it just a limitation I'll have to work around? Or is there another library or method to accomplish this same goal?
Upvotes: 1
Views: 1242
Reputation: 57033
Split your data into N files, such that N is large enough for each single file to be read into RAM and processed in its entirety. N=25 or 50 may be a good choice. For each of these files, find X most frequent bigrams and combine them in a single list L0. Then choose the smallest frequency f0 on the list.
At the second pass, go through all the files again and collect the bigrams whose frequency is at least f0/N in any file (this gives them a hope of making it into the top X).
Finally, calculate the total frequencies of each of the collected bigram, insert them into L0, and select the top X bigrams.
If bigram frequencies in each file follow Zipf law, you should be able to extract the top X bigrams with whatever limited RAM you have.
Upvotes: 2