Reputation: 244

Memory over when reading a CSV file by python

Question 1.

I tried reading a CSV file with size of ~1GB like below

import csv

res = []
with open("my_csv.csv", "r") as f:
    reader = csv.reader(f)
    for row in reader:
        res.append(row)

I thought 1GB is small enough to load it on my memory as a list. But in fact, the code freezes and the memory usage was 100%. I had checked the extra memory was few GB before I ran the code.

This answer says,

"You are reading all rows into a list, then processing that list. Don't do that."

But I wonder WHY? Why does the list possesses much much bigger memory than the file size?

Question 2.

Is there any method to parse a CSV into a dict without a memory issue?

For example,

CSV

apple,1,2,a    
apple,4,5,b    
banana,AAA,0,3    
kiwi,g1,g2,g3

Dict

{"apple" : [[1, 2, a], [4, 5, b]],
 "banana": [[AAA, 0, 3]],
 "kiwi"  : [[g1, g2, g3]]}

Upvotes: 1

Answers (2)

Zach Young

Reputation: 11188

To answer you second question:

Is there any method to parse a CSV into a dict without a memory issue?

You're not saying what a "memory issue" is, but if you're parsing a CSV into a dict in Python, you're going to use more memory than the CSV itself.

I created a script to generate "big" CSVs and then monitored time and peak memory consumption using @Barmar's code to build the result dict and noticed that on average that code used 10X more memory than the size of the CSV.

Below are my results from processing 3 of those "big" files, one with 100K rows, with 1M rows, and one with 10M rows.

The stats for the csv-to-dict process of each file is shown in the 3 blocks below:

The first line is from ls -h <CSV-FILE>
The next two lines are from /usr/bin/time -l <CSV-FILE>

715M Jan  6 19:44 gen_10000000x10.csv
55.98 real        49.54 user         4.33 sys
7.46G  peak memory footprint
---
72M  Jan  6 19:47 gen_1000000x10.csv
4.66 real         4.49 user         0.15 sys
753M  peak memory footprint
---
7.2M Jan  6 19:44 gen_100000x10.csv
0.35 real         0.32 user         0.02 sys
79M  peak memory footprint

Upvotes: 0

Barmar

Reputation: 780974

Appending millions of elements to a list in a loop like that can be inefficient, because periodically the list grows beyond its current allocation and has to be copied to a new area of memory increase its size. This will happen over and over with larger lists, so it becomes an exponential process.

You might be better off using the list() function, which may be able to do it more efficiently.

with open("my_csv.csv", "r") as f:
    reader = csv.reader(f)
    res = list(reader)

Even if it still has the same memory issues, it will be faster simply because the loop is in optimized C code rather than interpreted Python.

There's also overhead from all the lists themselves. Internally, a list has some header information, and then pointers to the data for each list element. There can also be excess space allocated to allow for growth without reallocating, but I suspect the csv module is able to avoid this (it's uncommon to append to lists read from a CSV). This overhead is usually not significant, but if you have many lists and the elements are small, the overhead can come close to doubling the memory required.

For your second question, you should heed the advice in the question you linked to. Process the file one record at a time, adding to the dictionary as you go.

result = {}
with open("my_csv.csv", "r") as f:
    reader = csv.reader(f)
    for row in reader:
        result.setdefault(row[0], []).append(row[1:])

Upvotes: 2

Memory over when reading a CSV file by python

Question 1.

Question 2.

Answers (2)

Related Questions