pandas read_csv with chunksize argument produces an iterable which can only be used once?

Question

I'm opening a file named file.dat using pandas.read_csv. file.dat contains several hundred million lines, so its size exceeds my available ram. The file looks like:

2.069921794968841368e+03 4.998600000000000000e+04
2.069943528235504346e+03 4.998600000000000000e+04
2.070004614137329099e+03 4.998300000000000000e+04
2.070022949424665057e+03 4.998100000000000000e+04
2.070029861936420730e+03 4.998000000000000000e+04
....
.... 
....

The code snippet to open the file is:

file = pd.read_csv("file.dat", 
                     delim_whitespace = True, index_col = None,
                     iterator = True, chunksize = 1000)

I have a function process which iterates through file and performs an analysis:

def process(file, arg):
    output = []
    for chunk in file: # iterate through each chunk of the file 
        val = evaluate(chunk, arg) # do something involving chunk and arg
        output.append(val) # and incorporate this into output
    return output # then return the result

This all works fine. However, to do multiple runs of process(file, arg), I have to rerun the file = pd.read_csv snippet. For example, this does not work:

outputs = []
for arg in [arg1, arg2, arg3]:
    outputs.append(process(file, arg))

but this does:

outputs = []
for arg in [arg1, arg2, arg3]:
    `file = pd.read_csv("file.dat", 
                         delim_whitespace = True, index_col = None,
                         iterator = True, chunksize = 1000)
    outputs.append(process(file, arg))

The essential problem is that the iterable produced by pd.read_csv is only usable once. Why is this so? Is this the expected behavior?

apitsch · Accepted Answer

This is the expected behavior because the TextFileReader object, which the pd.read_csv function with the specified chunksize parameter returns, is an Iterator, not an Iterable.

I admit that there is some confusing wording around what object you will get returned. Here in the documentation you're being told that you get an "iterable object". But if you have a look at the source code in the pandas.io.parsers.py file you will find that the TextFileReader object is an iterator as the class contains a __next__ method.

So, in your case file is an iterator which is used up after one call of the process function. You can observe a similar effect in this toy example with a numpy.array:

import numpy as np


arr1 = np.array([1, 2, 3])
arr2 = iter(arr1)


def process(file, arg):
    output = []
    for chunk in file:  # iterate through each chunk of the file
        val = chunk ** arg  # do something involving f and arg
        output.append(val)  # and incorporate this into output
    return output  # then return the result


outputs1 = []
for arg in [1, 2, 3]:
    outputs1.append(process(arr1, arg))

outputs2 = []
for arg in [1, 2, 3]:
    outputs2.append(process(arr2, arg))

Then you get:

>>> outputs1
[[1, 2, 3], [1, 4, 9], [1, 8, 27]]
>>> outputs2
[[1, 2, 3], [], []]

pandas read_csv with chunksize argument produces an iterable which can only be used once?

Answers (1)

Related Questions