Reputation: 123
I have a file of size 500MB
if I store each line of that file in a dictionary setup like
file = "my_file.csv"
with open(file) as f:
for l in f:
delimiter = ','
line = l.split(delimiter)
hash_key = delimiter.join(line[:4])
store_line = delimiter.join(line[4:])
store_dict[hash_key] = store_line
To check my memory, I compared the memory usage of my program by watching htop
, first with the above, then switching the last line to
print(hash_key + ":" + store_line)
And that took < 100MB of memory.
the size of my store_dict is approximately 1.5GB
in memory. I have checked for memory leaks, I can't find any. Removing this line store_dict[hash_key] = store_line
results in the program taking < 100MB of memory. Why does this take up so much memory? Is there anyway to store the lines as a dictionary and not have it take up so much memory?
Upvotes: 0
Views: 519
Reputation: 13100
Even if the store_line
str
s each took up the same amount of memory as the corresponding piece of text in the file on the disk (which they properly don't, especially if you are using Python 3 where str
s default to Unicode), the dict
necessarily takes up way more space than your file. The dict
does not only contain the bare text, but a lot of Python objects.
Each dict
key and value is a str
, which each carries not just text information, but also their own lengths and reference counting used for garbage collection. The dict
itself also need to store meta data about its items, such as the hash of each key and a pointer to each value.
If you had a few, very long lines in the file, then you should expect the Python representation to have comparable memory consumption. That is, if you are sure that the file uses the same encoding as Python...
Upvotes: 2