Dictionaries and Big Inputs on Python

I have a big input 20Gb text file which I process. I create an index which I store in a dict. Problem is that I access this dict for every term inside the file plus for every term I may add it as an item to the dict, so I can not just write it to the disk. When I reach my maximum RAM capacity (8gb ram) the system (win8 64-bit) starts paging to virtual memory so I/O is extremely high and system is unstable (I got blue screen 1 time). Any idea how can I improve it ?


edit for example psedocode

input = open("C:\\input.txt",'r').read()
text = input.split()
temp_dict = {}
for i,word in text:
    if word in temp_dict :
      text[i] = something()          
    else:
      temp_dict[word] = hash_function()

print(temp_dict , file=...)
print(text, file=...)

Upvotes: 0

Views: 120

Answers (2)

Aaron Hall
Aaron Hall

Reputation: 394965

Don't read the entire file into memory, you should do something like this:

with open("/input.txt",'rU') as file:
    index_dict = {}
    for line in file:
        for word in line.split()
            index_dict.setdefault(word, []).append(file.tell() + line.find(word))

To break it down, open the file with a context manager so that if you get an error, it automatically closes the file for you. I also changed the path to work on Unix, and added the U flag for Universal readline mode.

with open("/input.txt",'rU') as file:

Since semantically, an index is a list of words keyed to their location, I'm changing the dict to index_dict:

    index_dict = {}

Using the file object directly as an iterator prevents you from reading the entire file into memory:

    for line in file:

Then we can split the line and iterate by word:

        for word in line.split()

and using the dict.setdefault method, we'll put the location of the word in an empty list if the key isn't already there, but if it is there, we just append it to the list already there:

            index_dict.setdefault(word, []).append(file.tell() + line.find(word))

Does that help?

Upvotes: 1

remram
remram

Reputation: 5203

I would recommend simply using a database instead of a dictionary. In its simplest form, a database is a disk-based datastructure which are meant to span several gigabytes.

You can have a look at sqlite3 or SQLAlchemy for instance.

Additionally, you probably don't want to load the whole input file in memory at once either.

Upvotes: 0

Related Questions