jo2083248
jo2083248

Reputation: 123

Why does a file stored as a dictionary take up much more space than file

I have a file of size 500MB if I store each line of that file in a dictionary setup like

file = "my_file.csv"
with open(file) as f:
    for l in f:
        delimiter = ','
        line = l.split(delimiter)
        hash_key = delimiter.join(line[:4])
        store_line = delimiter.join(line[4:])
        store_dict[hash_key] = store_line

To check my memory, I compared the memory usage of my program by watching htop, first with the above, then switching the last line to

print(hash_key + ":" + store_line) 

And that took < 100MB of memory.

the size of my store_dict is approximately 1.5GB in memory. I have checked for memory leaks, I can't find any. Removing this line store_dict[hash_key] = store_line results in the program taking < 100MB of memory. Why does this take up so much memory? Is there anyway to store the lines as a dictionary and not have it take up so much memory?

Upvotes: 0

Views: 519

Answers (1)

jmd_dk
jmd_dk

Reputation: 13100

Even if the store_line strs each took up the same amount of memory as the corresponding piece of text in the file on the disk (which they properly don't, especially if you are using Python 3 where strs default to Unicode), the dict necessarily takes up way more space than your file. The dict does not only contain the bare text, but a lot of Python objects.

Each dict key and value is a str, which each carries not just text information, but also their own lengths and reference counting used for garbage collection. The dict itself also need to store meta data about its items, such as the hash of each key and a pointer to each value.

If you had a few, very long lines in the file, then you should expect the Python representation to have comparable memory consumption. That is, if you are sure that the file uses the same encoding as Python...

Upvotes: 2

Related Questions