lessthanl0l
lessthanl0l

Reputation: 1095

pandas read_csv import gives mixed type for a column

I have a csv file that contains 130,000 rows. After reading in the file using pandas' read_csv function, one of the Column("CallGuid") has mixed object types.

I did:

df = pd.read_csv("data.csv")

Then I have this:

In [10]: df["CallGuid"][32767]
Out[10]: 4129237051L    

In [11]: df["CallGuid"][32768]
Out[11]: u'4129259051'

All rows <= 32767 are of type long and all rows > 32767 are unicode

Why is this?

Upvotes: 15

Views: 10647

Answers (2)

WNG
WNG

Reputation: 3805

OK I just experienced the same problem, with the same symptom : df[column][n] changed type after n>32767

I indeed had a problem in my data, but not at all at line 32767

Finding and modifying these few problematic lines solved my problem. I managed to localize the line that was problematic by using the following extremely dirty routine :

df = pd.read_csv('data.csv',chunksize = 10000)
i=0
for chunk in df:
    print "{} {}".format(i,chunk["Custom Dimension 02"].dtype)
    i+=1

I ran this and I obtained :

0 int64
1 int64
2 int64
3 int64
4 int64
5 int64
6 object
7 int64
8 object
9 int64
10 int64

Which told me that there was (at least) one problematic line between 60000 and 69999 and one between 80000 and 89999

To localize them more precisely, you can just take a smaller chunksize and print only the number of the rows that do not have the correct dta type

Upvotes: 3

paulo.filip3
paulo.filip3

Reputation: 3297

As others have pointed out, your data could be malformed, like having quotes or something...

Just try doing:

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", dtype={"CallGuid": np.int64})

It's also more memory efficient, since pandas doesn't have to guess the data types.

Upvotes: 6

Related Questions