malexmave
malexmave

Reputation: 1300

Python Memory error solutions if permanent access is required

first, I am aware of the amount of Python memory error questions on SO, but so far, none has matched my use case.

I am currently trying to parse a bunch of textfiles (~6k files with ~30 GB) and store each unique word. Yes, I am building a wordlist, no I am not planning on doing evil things with it, it is for the university.

I implemented the list of found words as a set (created with words = set([]), used with words.add(word)) and I am just adding every found word to it, considering that the set mechanics should remove all duplicates.

This means that I need permanent access to the whole set for this to work (Or at least I see no alternative, since the whole list has to be checked for duplicates on every insert).

Right now, I am running into MemoryError about 25% through, when it uses about 3.4 GB of my RAM. I am on a Linux 32bit, so I know where that limitation comes from, and my PC only has 4 Gigs of RAM, so even 64 bit would not help here.

I know that the complexity is probably terrible (Probably O(n) on each insert, although I don't know how Python sets are implemented (trees?)), but it is still (probably) faster and (definitly) more memory efficient than adding each word to a primitive list and removing duplicates afterwards.

Is there any way to get this to run? I expect about 6-10 GB of unique words, so using my current RAM is out of the question, and upgrading my RAM is currently not possible (and does not scale too well once I start letting this script loose on larger amounts of files).

My only Idea at the moment is caching on Disk (Which will slow the process down even more), or writing temporary sets to disk and merging them afterwards, which will take even more time and the complexity would be horrible indeed. Is there even a solution that will not result in horrible runtimes?

For the record, this is my full source. As it was written for personal use only, it is pretty horrible, but you get the idea.

import os
import sys
words=set([])
lastperc = 0
current = 1
argl = 0
print "Searching for .txt-Files..."
for _,_,f in os.walk("."):
    for file in f:
        if file.endswith(".txt"):
            argl=argl+1
print "Found " + str(argl) + " Files. Beginning parsing process..."
print "0%                                              50%                                             100%"
for r,_,f in os.walk("."):
    for file in f:
        if file.endswith(".txt"):
            fobj = open(os.path.join(r,file),"r")
            for line in fobj:
                line = line.strip()
                word, sep, remains = line.partition(" ")
                if word != "":
                    words.add(word)
                word, sep, remains = remains.partition(" ")
                while sep != "":
                    words.add(word)
                    word, sep, remains2 = remains.partition(" ")
                    remains = remains2
                if remains != "":
                    words.add(remains)
            newperc = int(float(current)/argl*100)
            if newperc-lastperc > 0:
                for i in range(newperc-lastperc):
                    sys.stdout.write("=")
                    sys.stdout.flush()
            lastperc = newperc
            current = current+1
print ""
print "Done. Set contains " + str(len(words)) + " different words. Sorting..."
sorteddic = sorted(words, key=str.lower)
print "Sorted. Writing to File"
print "0%                                              50%                                             100%"
lastperc = 0
current = 1
sdicl = len(sorteddic)-1
fobj = open(sys.argv[1],"w")
for element in sorteddic:
    fobj.write(element+"\n")
    newperc = int(float(current)/sdicl*100)
    if newperc-lastperc > 0:
        for i in range(newperc-lastperc):
            sys.stdout.write("=")
            sys.stdout.flush()
    lastperc = newperc
    current = current+1
print ""
print "Done. Enjoy your wordlist."

Thanks for your help and Ideas.

Upvotes: 0

Views: 370

Answers (5)

Sven Marnach
Sven Marnach

Reputation: 602115

The first thing I'd try would be to restrict words to lower-case characters – as Tyler Eaves pointed out, this will probably reduce the set size enough to fit into memory. Here's some very basic code to do this:

import os
import fnmatch
import re

def find_files(path, pattern):
    for root, files, directories in os.walk(path):
        for f in fnmatch.filter(files, pattern):
            yield os.path.join(root, f)

words = set()
for file_name in find_files(".", "*.txt"):
    with open(file_name) as f:
        data = f.read()
    words.update(re.findall("\w+", data.lower()))

A few more comments:

  1. I would usually expect the dictionary to grow rapidly at the beginning; very few new words should be found late in the process, so your extrapolation might severely overestimate the final size of the word list.

  2. Sets are very efficient for this purpose. They are implemented as hash tables, and adding a new word has an amortised complexity of O(1).

Upvotes: 1

starbolin
starbolin

Reputation: 840

Hash your keys into a codespace that is smaller and more managable. Key the hash to a file containing the keys with that hash. The table of hashes is much smaller and the individual key files are much smaller.

Upvotes: 0

Karmel
Karmel

Reputation: 3472

My inclination is towards a database table, but if you want to stay within a single framework, checkout PyTables: http://www.pytables.org/moin

Upvotes: 1

Tyler Eaves
Tyler Eaves

Reputation: 13131

Do you really mean 6-10GB of unique words? Is this English text? Surely even counting proper nouns and names there shouldn't be more than a few million unique words.

Anyway, what I would do is process one file at a time, or even one section (say, 100k) of a file at a time, a build a unique wordlist just for that portion. Then just union all the sets as a post-processing step.

Upvotes: 2

Daniel Roseman
Daniel Roseman

Reputation: 599788

You're probably going to need to store the keys on disk. A key-value store like Redis might fit the bill.

Upvotes: 3

Related Questions