user7779326
user7779326

Reputation:

how to solve error due to chunksize in pandas?

I am trying to read a large csv file and run a code. I use chunk size to do the same.

file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
print len(df.index)

I get the following error in the code:

AttributeError: 'TextFileReader' object has no attribute 'index'

How to resolve this?

Upvotes: 4

Views: 7303

Answers (1)

Abdou
Abdou

Reputation: 13274

Those errors are stemming from the fact that your pd.read_csv call, in this case, does not return a DataFrame object. Instead, it returns a TextFileReader object, which is an iterator. This is, essentially, because when you set the iterator parameter to True, what is returned is NOT a DataFrame; it is an iterator of DataFrame objects, each the size of the integer passed to the chunksize parameter (in this case 1000000). Specific to your case, you can't just call df.index because, simply, an iterator object does not have an index attribute. This does not mean that you cannot access the DataFrames inside the iterator. What it means is that you would either have to loop through the iterator to access one DataFrame at a time or you would have to use some kind of way of concatenating all those DataFrames into one giant one.

If you are considering just working with one DataFrame at a time, then the following is what you would need to do to print the indexes of each DataFrame:

file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)

for df in dfs:
    print(df.index)
    # do something
    df.to_csv('output_file.csv', mode='a', index=False)

This will save the DataFrames into an output file with the name output_file.csv. With the mode parameter set to a, the operations should append to the file. As a result, nothing should be overwritten.

However, if the goal for you is to concatenate all the DataFrames into one giant DataFrame, then the following would perhaps be a better path:

file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)

giant_df = pd.concat(dfs)

print(giant_df.index)

Since you are already using the iterator parameter here, I would assume that you are concerned about memory. As such, the first strategy would be a better one. That basically means that you are taking advantage of the benefits that iterators offer when it comes to memory management for large datasets.

I hope this proves useful.

Upvotes: 5

Related Questions