Vitaliy
Vitaliy

Reputation: 137

Pandas read_csv strange behaviour

Please try to understand the reason of the following read_csv behaviour: I am trying to read a huge file in chunks

c=1
for chunk in pd.read_csv(filename, chunksize=chunksize):
   print 'chunk ', str(c), ' started'
   ....data normalization....
   ....saving the transformed data to file....

I get an error like this:

sys:1: DtypeWarning: Columns (...) have mixed types. Specify dtype option on import or set low_memory=False.
chunk  19  started
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for -: 'str' and 'float'

from the error I can see, that for some reason at chunk 19 pandas interpreted the float data as string, and cannot perform '-' operation.

However, if I skip 18 chunks, and start from chunk 19 it goes well. Intuition says it might be some memory problem, but I would like understand the reason.

Upvotes: 0

Views: 355

Answers (1)

Batman
Batman

Reputation: 8917

It's not a memory problem.

Pandas makes guesses about what the data types should be if you don't specify the dtype argument. Sometimes, it realises that it's made a mistake, and will convert the data type of a column on the fly, if it thinks that's the correct thing to do. In this case, it would appear that it's guessing that the correct type is a numerical one, and then later on encountering some data that makes it think that the column should really be strings, and is converting. Does the data have anything like 'N/A' in it, by any chance?

Just specify the dtype argument. It will make the read_csv faster and more efficient, and you'll either fix the problem, or get a better idea of what's causing it.

Upvotes: 1

Related Questions