Reputation: 1984
I have a data frame that is somehow all objects - which I think should be okay. Notice that the first column has values like "10180.".
Problem solved: There was some kind of weird unicode thing going on. My task lead solved the problem. We just read it in as straight excel instead of converting to csv (I was using libreoffice to do that). Problem solved. A big hint was all these things that "should" work that were not working.
Those should all be "10180" - no decimal. (Note that in Jupyter it displays correctly. Only should up as a decimal when I output as csv. However Jupyter does know that it's an object.)
Another problem is potentially the data values that look like "2,361.9". Those should be floats. I thought I could do a similar thing with those to get rid of the commas and then convert.
Sample data:
CBSA Code,CBSA Title,violent,murder,rape,robbery,assault,property,burglary,larceny,vehicle theft
10180.0,"Abilene, TX",393.2,5.3,64.0,65.7,258.2,"2,361.9",534.0,"1,670.0",157.8
10420.0,"Akron, OH",361.6,6.4,48.7,73.0,233.6,"2,226.0",415.6,"1,659.4",150.9
10500.0,"Albany, GA",728.5,11.6,30.6,95.1,591.3,"3,734.5",773.4,"2,715.1",246.0
10580.0,"Albany-Schenectady-Troy, NY",283.7,2.2,38.3,62.4,180.8,"1,892.3",226.9,"1,584.8",80.6
That first column should be integer. I've tried
df[‘CBSA Code’].apply(np.int64) AND
df[‘CBSA Code’].astype(int) AND
df[‘CBSA Code’].astype(str).astype(int) AND
df[‘CBSA Code’] = df[‘CBSA Code’].astype(str)
df[‘CBSA Code’] = df[‘CBSA Code’].replace(“.0”, ’’)
df[‘CBSA Code’] = df[‘CBSA Code’].astype(‘int’)
I've seen some of these posted as answers in other stackoverflow questions. But it's not working for me. This must be a common dilemma. Is there a canonical way of doing this?
The error msg with the df[‘CBSA Code’].apply(np.int64) follows
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-189-6c1c6381a02c> in <module>
----> 1 df['CBSA Code'].apply(np.int64)
~\AppData\Roaming\Python\Python37\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3589 else:
3590 values = self.astype(object).values
-> 3591 mapped = lib.map_infer(values, f, convert=convert_dtype)
3592
3593 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
ValueError: invalid literal for int() with base 10: '10180.0'
Upvotes: 0
Views: 1709
Reputation: 25239
I suspect CBSA Code
has some non-numeric values, so read_csv defaults it to dtype object
. You may try using nullable integer dtype Int64
(note: it is uppercase 'I'
)
df['CBSA Code'] = pd.to_numeric(df['CBSA Code'], errors='coerce').astype('Int64')
Upvotes: 1
Reputation: 4660
If the issue is that the CBSA Code
column is a float formatted as a string (as it seems from the error message: ValueError: invalid literal for int() with base 10: '10180.0'
), then you can't convert directly to int, but you can cast as float first and then as int:
df["CBSA Code"] = df["CBSA Code"].astype(float).astype(int)
Upvotes: 1