Reputation: 711
I am dealing with a big dataset, therefore to read it in pandas I use read_csv
with chunk=
option.
data = pd.read_csv("dataset.csv", chunksize=2e5)
then I operate on the chunked DataFrame in the following way
any_na_cols = [chunk.do_something() for chunk in data]
the problem is, when I want to do something else in the same way as above, I will get an empty result because I have iterated over chunked DataFrame already. Therefore I would have to call data = pd.read_csv("dataset.csv", chunksize=2e5)
again to perform next operation.
Most likely there is no problem with that, but for some reason I feel that this approach is inelegant in some way. Isn't there a method like data.rewind()
or something similar that would enable me to iterate through the chunks again? I could not find anything like that in the Documentation. Or maybe I am comitting some design mistake with that approach?
Upvotes: 2
Views: 613
Reputation: 210882
I don't think it's a good idea to read your CSV again - you will double the number of IOs. It's better to "do something else" during the same iteration:
any_na_cols = pd.DataFrame()
for chunk in pd.read_csv("dataset.csv", chunksize=2e5)
any_na_cols = pd.concat([any_na_cols, chunk.do_something()], ignore_index=True)
# do something else
Upvotes: 2