Reputation:
I am trying to read a large csv file and run a code. I use chunk size to do the same.
file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
print len(df.index)
I get the following error in the code:
AttributeError: 'TextFileReader' object has no attribute 'index'
How to resolve this?
Upvotes: 4
Views: 7303
Reputation: 13274
Those errors are stemming from the fact that your pd.read_csv
call, in this case, does not return a DataFrame
object. Instead, it returns a TextFileReader
object, which is an iterator
. This is, essentially, because when you set the iterator
parameter to True
, what is returned is NOT a DataFrame
; it is an iterator
of DataFrame objects, each the size of the integer passed to the chunksize
parameter (in this case 1000000
).
Specific to your case, you can't just call df.index
because, simply, an iterator
object does not have an index
attribute. This does not mean that you cannot access the DataFrames
inside the iterator
. What it means is that you would either have to loop through the iterator to access one DataFrame
at a time or you would have to use some kind of way of concatenating all those DataFrames
into one giant one.
If you are considering just working with one DataFrame
at a time, then the following is what you would need to do to print the indexes of each DataFrame
:
file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
for df in dfs:
print(df.index)
# do something
df.to_csv('output_file.csv', mode='a', index=False)
This will save the DataFrames
into an output file with the name output_file.csv
. With the mode
parameter set to a
, the operations should append to the file. As a result, nothing should be overwritten.
However, if the goal for you is to concatenate all the DataFrames into one giant DataFrame
, then the following would perhaps be a better path:
file = "./data.csv"
dfs = pd.read_csv(file, sep="/", header=0,iterator=True, chunksize=1000000, dtype=str)
giant_df = pd.concat(dfs)
print(giant_df.index)
Since you are already using the iterator
parameter here, I would assume that you are concerned about memory. As such, the first strategy would be a better one. That basically means that you are taking advantage of the benefits that iterators
offer when it comes to memory management for large datasets.
I hope this proves useful.
Upvotes: 5