Reputation: 1300
first, I am aware of the amount of Python memory error questions on SO, but so far, none has matched my use case.
I am currently trying to parse a bunch of textfiles (~6k files with ~30 GB) and store each unique word. Yes, I am building a wordlist, no I am not planning on doing evil things with it, it is for the university.
I implemented the list of found words as a set (created with words = set([])
, used with words.add(word)
) and I am just adding every found word to it, considering that the set mechanics should remove all duplicates.
This means that I need permanent access to the whole set for this to work (Or at least I see no alternative, since the whole list has to be checked for duplicates on every insert).
Right now, I am running into MemoryError
about 25% through, when it uses about 3.4 GB of my RAM. I am on a Linux 32bit, so I know where that limitation comes from, and my PC only has 4 Gigs of RAM, so even 64 bit would not help here.
I know that the complexity is probably terrible (Probably O(n) on each insert, although I don't know how Python sets are implemented (trees?)), but it is still (probably) faster and (definitly) more memory efficient than adding each word to a primitive list and removing duplicates afterwards.
Is there any way to get this to run? I expect about 6-10 GB of unique words, so using my current RAM is out of the question, and upgrading my RAM is currently not possible (and does not scale too well once I start letting this script loose on larger amounts of files).
My only Idea at the moment is caching on Disk (Which will slow the process down even more), or writing temporary sets to disk and merging them afterwards, which will take even more time and the complexity would be horrible indeed. Is there even a solution that will not result in horrible runtimes?
For the record, this is my full source. As it was written for personal use only, it is pretty horrible, but you get the idea.
import os
import sys
words=set([])
lastperc = 0
current = 1
argl = 0
print "Searching for .txt-Files..."
for _,_,f in os.walk("."):
for file in f:
if file.endswith(".txt"):
argl=argl+1
print "Found " + str(argl) + " Files. Beginning parsing process..."
print "0% 50% 100%"
for r,_,f in os.walk("."):
for file in f:
if file.endswith(".txt"):
fobj = open(os.path.join(r,file),"r")
for line in fobj:
line = line.strip()
word, sep, remains = line.partition(" ")
if word != "":
words.add(word)
word, sep, remains = remains.partition(" ")
while sep != "":
words.add(word)
word, sep, remains2 = remains.partition(" ")
remains = remains2
if remains != "":
words.add(remains)
newperc = int(float(current)/argl*100)
if newperc-lastperc > 0:
for i in range(newperc-lastperc):
sys.stdout.write("=")
sys.stdout.flush()
lastperc = newperc
current = current+1
print ""
print "Done. Set contains " + str(len(words)) + " different words. Sorting..."
sorteddic = sorted(words, key=str.lower)
print "Sorted. Writing to File"
print "0% 50% 100%"
lastperc = 0
current = 1
sdicl = len(sorteddic)-1
fobj = open(sys.argv[1],"w")
for element in sorteddic:
fobj.write(element+"\n")
newperc = int(float(current)/sdicl*100)
if newperc-lastperc > 0:
for i in range(newperc-lastperc):
sys.stdout.write("=")
sys.stdout.flush()
lastperc = newperc
current = current+1
print ""
print "Done. Enjoy your wordlist."
Thanks for your help and Ideas.
Upvotes: 0
Views: 370
Reputation: 602115
The first thing I'd try would be to restrict words to lower-case characters – as Tyler Eaves pointed out, this will probably reduce the set size enough to fit into memory. Here's some very basic code to do this:
import os
import fnmatch
import re
def find_files(path, pattern):
for root, files, directories in os.walk(path):
for f in fnmatch.filter(files, pattern):
yield os.path.join(root, f)
words = set()
for file_name in find_files(".", "*.txt"):
with open(file_name) as f:
data = f.read()
words.update(re.findall("\w+", data.lower()))
A few more comments:
I would usually expect the dictionary to grow rapidly at the beginning; very few new words should be found late in the process, so your extrapolation might severely overestimate the final size of the word list.
Sets are very efficient for this purpose. They are implemented as hash tables, and adding a new word has an amortised complexity of O(1).
Upvotes: 1
Reputation: 840
Hash your keys into a codespace that is smaller and more managable. Key the hash to a file containing the keys with that hash. The table of hashes is much smaller and the individual key files are much smaller.
Upvotes: 0
Reputation: 3472
My inclination is towards a database table, but if you want to stay within a single framework, checkout PyTables: http://www.pytables.org/moin
Upvotes: 1
Reputation: 13131
Do you really mean 6-10GB of unique words? Is this English text? Surely even counting proper nouns and names there shouldn't be more than a few million unique words.
Anyway, what I would do is process one file at a time, or even one section (say, 100k) of a file at a time, a build a unique wordlist just for that portion. Then just union all the sets as a post-processing step.
Upvotes: 2
Reputation: 599788
You're probably going to need to store the keys on disk. A key-value store like Redis might fit the bill.
Upvotes: 3