Reputation: 1891
I have a large dataframe with ID numbers:
ID.head()
Out[64]:
0 4806105017087
1 4806105017087
2 4806105017087
3 4901295030089
4 4901295030089
These are all strings at the moment.
I want to convert to int
without using loops - for this I use ID.astype(int)
.
The problem is that some of my lines contain dirty data which cannot be converted to int
, for e.g.
ID[154382]
Out[58]: 'CN414149'
How can I (without using loops) remove these type of occurrences so that I can use astype
with peace of mind?
Upvotes: 67
Views: 269666
Reputation: 1
I solved it Jan-2024 in the latest version of jupyter notebook by doing this.
Always use try and catch to see if its not working than what the error. I checked the "Price" data type and previously it was "o" and now its showing "int(64)". That's what we all looking for.
try:
car_sales["Price"] = car_sales["Price"].str.replace('[\$\,]|\.\d*', '', regex=True).astype(int)
except ValueError as e:
print(f"Error: {e}")
Upvotes: 0
Reputation: 23459
OverflowError: Python int too large to convert to C long
use .astype('int64')
for 64-bit signed integers:
df['ID'] = df['ID'].astype('int64')
If you don't want to lose the values with letters in them, use str.replace()
with a regex pattern to remove the non-digit characters.
df['ID'] = df['ID'].str.replace('[^0-9]', '', regex=True).astype('int64')
Then input
0 4806105017087
1 4806105017087
2 CN414149
Name: ID, dtype: object
converts into
0 4806105017087
1 4806105017087
2 414149
Name: ID, dtype: int64
Upvotes: 10
Reputation: 863731
You need add parameter errors='coerce'
to function to_numeric
:
ID = pd.to_numeric(ID, errors='coerce')
If ID
is column:
df.ID = pd.to_numeric(df.ID, errors='coerce')
but non numeric are converted to NaN
, so all values are float
.
For int
need convert NaN
to some value e.g. 0
and then cast to int
:
df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)
Sample:
df = pd.DataFrame({'ID':['4806105017087','4806105017087','CN414149']})
print (df)
ID
0 4806105017087
1 4806105017087
2 CN414149
print (pd.to_numeric(df.ID, errors='coerce'))
0 4.806105e+12
1 4.806105e+12
2 NaN
Name: ID, dtype: float64
df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)
print (df)
ID
0 4806105017087
1 4806105017087
2 0
EDIT: If use pandas 0.25+ then is possible use integer_na
:
df.ID = pd.to_numeric(df.ID, errors='coerce').astype('Int64')
print (df)
ID
0 4806105017087
1 4806105017087
2 NaN
Upvotes: 120