Reputation: 2726
I'm opening a file named file.dat
using pandas.read_csv
. file.dat
contains several hundred million lines, so its size exceeds my available ram. The file looks like:
2.069921794968841368e+03 4.998600000000000000e+04
2.069943528235504346e+03 4.998600000000000000e+04
2.070004614137329099e+03 4.998300000000000000e+04
2.070022949424665057e+03 4.998100000000000000e+04
2.070029861936420730e+03 4.998000000000000000e+04
....
....
....
The code snippet to open the file is:
file = pd.read_csv("file.dat",
delim_whitespace = True, index_col = None,
iterator = True, chunksize = 1000)
I have a function process
which iterates through file
and performs an analysis:
def process(file, arg):
output = []
for chunk in file: # iterate through each chunk of the file
val = evaluate(chunk, arg) # do something involving chunk and arg
output.append(val) # and incorporate this into output
return output # then return the result
This all works fine. However, to do multiple runs of process(file, arg)
, I have to rerun the file = pd.read_csv
snippet. For example, this does not work:
outputs = []
for arg in [arg1, arg2, arg3]:
outputs.append(process(file, arg))
but this does:
outputs = []
for arg in [arg1, arg2, arg3]:
`file = pd.read_csv("file.dat",
delim_whitespace = True, index_col = None,
iterator = True, chunksize = 1000)
outputs.append(process(file, arg))
The essential problem is that the iterable produced by pd.read_csv
is only usable once. Why is this so? Is this the expected behavior?
Upvotes: 2
Views: 218
Reputation: 1702
This is the expected behavior because the TextFileReader
object, which the pd.read_csv
function with the specified chunksize
parameter returns, is an Iterator, not an Iterable.
I admit that there is some confusing wording around what object you will get returned. Here in the documentation you're being told that you get an "iterable object". But if you have a look at the source code in the pandas.io.parsers.py file you will find that the TextFileReader
object is an iterator as the class contains a __next__
method.
So, in your case file
is an iterator which is used up after one call of the process
function. You can observe a similar effect in this toy example with a numpy.array:
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = iter(arr1)
def process(file, arg):
output = []
for chunk in file: # iterate through each chunk of the file
val = chunk ** arg # do something involving f and arg
output.append(val) # and incorporate this into output
return output # then return the result
outputs1 = []
for arg in [1, 2, 3]:
outputs1.append(process(arr1, arg))
outputs2 = []
for arg in [1, 2, 3]:
outputs2.append(process(arr2, arg))
Then you get:
>>> outputs1
[[1, 2, 3], [1, 4, 9], [1, 8, 27]]
>>> outputs2
[[1, 2, 3], [], []]
Upvotes: 3