elbillaf
elbillaf

Reputation: 1984

Pandas says every column is an object, even though I think it's an integer

I have a data frame that is somehow all objects - which I think should be okay. Notice that the first column has values like "10180.".

Problem solved: There was some kind of weird unicode thing going on. My task lead solved the problem. We just read it in as straight excel instead of converting to csv (I was using libreoffice to do that). Problem solved. A big hint was all these things that "should" work that were not working.

Those should all be "10180" - no decimal. (Note that in Jupyter it displays correctly. Only should up as a decimal when I output as csv. However Jupyter does know that it's an object.)

Another problem is potentially the data values that look like "2,361.9". Those should be floats. I thought I could do a similar thing with those to get rid of the commas and then convert.

Sample data:

CBSA Code,CBSA Title,violent,murder,rape,robbery,assault,property,burglary,larceny,vehicle theft
10180.0,"Abilene, TX",393.2,5.3,64.0,65.7,258.2,"2,361.9",534.0,"1,670.0",157.8
10420.0,"Akron, OH",361.6,6.4,48.7,73.0,233.6,"2,226.0",415.6,"1,659.4",150.9
10500.0,"Albany, GA",728.5,11.6,30.6,95.1,591.3,"3,734.5",773.4,"2,715.1",246.0
10580.0,"Albany-Schenectady-Troy, NY",283.7,2.2,38.3,62.4,180.8,"1,892.3",226.9,"1,584.8",80.6

That first column should be integer. I've tried

df[‘CBSA Code’].apply(np.int64)  AND

df[‘CBSA Code’].astype(int) AND

df[‘CBSA Code’].astype(str).astype(int) AND

df[‘CBSA Code’] = df[‘CBSA Code’].astype(str)
df[‘CBSA Code’] = df[‘CBSA Code’].replace(“.0”, ’’)
df[‘CBSA Code’] = df[‘CBSA Code’].astype(‘int’)

I've seen some of these posted as answers in other stackoverflow questions. But it's not working for me. This must be a common dilemma. Is there a canonical way of doing this?

The error msg with the df[‘CBSA Code’].apply(np.int64) follows

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-189-6c1c6381a02c> in <module>
----> 1 df['CBSA Code'].apply(np.int64)

~\AppData\Roaming\Python\Python37\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   3589             else:
   3590                 values = self.astype(object).values
-> 3591                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   3592 
   3593         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

ValueError: invalid literal for int() with base 10: '10180.0'

Upvotes: 0

Views: 1709

Answers (2)

Andy L.
Andy L.

Reputation: 25239

I suspect CBSA Code has some non-numeric values, so read_csv defaults it to dtype object. You may try using nullable integer dtype Int64 (note: it is uppercase 'I')

df['CBSA Code'] = pd.to_numeric(df['CBSA Code'], errors='coerce').astype('Int64')

Upvotes: 1

Toby Petty
Toby Petty

Reputation: 4660

If the issue is that the CBSA Code column is a float formatted as a string (as it seems from the error message: ValueError: invalid literal for int() with base 10: '10180.0'), then you can't convert directly to int, but you can cast as float first and then as int:

df["CBSA Code"] = df["CBSA Code"].astype(float).astype(int)

Upvotes: 1

Related Questions