Matek
Matek

Reputation: 711

Pandas - is it possible to "rewind" read_csv with chunk= argument?

I am dealing with a big dataset, therefore to read it in pandas I use read_csv with chunk= option.

data = pd.read_csv("dataset.csv", chunksize=2e5)

then I operate on the chunked DataFrame in the following way

any_na_cols = [chunk.do_something() for chunk in data]

the problem is, when I want to do something else in the same way as above, I will get an empty result because I have iterated over chunked DataFrame already. Therefore I would have to call data = pd.read_csv("dataset.csv", chunksize=2e5) again to perform next operation.

Most likely there is no problem with that, but for some reason I feel that this approach is inelegant in some way. Isn't there a method like data.rewind() or something similar that would enable me to iterate through the chunks again? I could not find anything like that in the Documentation. Or maybe I am comitting some design mistake with that approach?

Upvotes: 2

Views: 613

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

I don't think it's a good idea to read your CSV again - you will double the number of IOs. It's better to "do something else" during the same iteration:

any_na_cols = pd.DataFrame()

for chunk in pd.read_csv("dataset.csv", chunksize=2e5)
    any_na_cols = pd.concat([any_na_cols, chunk.do_something()], ignore_index=True)
    # do something else

Upvotes: 2

Related Questions