Reputation: 137
Please try to understand the reason of the following read_csv behaviour: I am trying to read a huge file in chunks
c=1
for chunk in pd.read_csv(filename, chunksize=chunksize):
print 'chunk ', str(c), ' started'
....data normalization....
....saving the transformed data to file....
I get an error like this:
sys:1: DtypeWarning: Columns (...) have mixed types. Specify dtype option on import or set low_memory=False.
chunk 19 started
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for -: 'str' and 'float'
from the error I can see, that for some reason at chunk 19 pandas interpreted the float data as string, and cannot perform '-' operation.
However, if I skip 18 chunks, and start from chunk 19 it goes well. Intuition says it might be some memory problem, but I would like understand the reason.
Upvotes: 0
Views: 355
Reputation: 8917
It's not a memory problem.
Pandas makes guesses about what the data types should be if you don't specify the dtype
argument. Sometimes, it realises that it's made a mistake, and will convert the data type of a column on the fly, if it thinks that's the correct thing to do. In this case, it would appear that it's guessing that the correct type is a numerical one, and then later on encountering some data that makes it think that the column should really be strings, and is converting. Does the data have anything like 'N/A'
in it, by any chance?
Just specify the dtype
argument. It will make the read_csv
faster and more efficient, and you'll either fix the problem, or get a better idea of what's causing it.
Upvotes: 1