Reputation: 14016
I have been given a survey for analysis. Unfortunately, some of the participants have used Arabic/Farsi numbers to fill some values. For example:
import pandas as pd
pd.DataFrame(["24", "۱۲", "45", "۳۲"], columns=["age"])
and what I want is to convert all the values to Python integers:
[24, 12, 45, 32]
What is the most canonical/performant way to do this
Upvotes: 2
Views: 422
Reputation: 9096
You can apply
Python's built-in int
function which does understand Arabic numerals:
>>> from pandas import DataFrame
>>> df = DataFrame(["24", "۱۲", "45", "۳۲"], columns=["age"])
>>> df['age'] = df['age'].apply(int)
>>> df['age']
0 24
1 12
2 45
3 32
Name: age, dtype: int64
Actually, numpy
/pandas
dtypes are also aware of Unicode numerals. So usual type castings also work:
>>> import pandas as pd
>>> pd.Series(["24", "۱۲", "45", "۳۲"], dtype='float64')
0 24.0
1 12.0
2 45.0
3 32.0
dtype: float64
>>> pd.Series(["24", "۱۲", "45", "۳۲"]).astype('int64')
0 24
1 12
2 45
3 32
dtype: int64
>>> import numpy as np
>>> np.array(["24", "۱۲", "45", "۳۲"], dtype='int64')
array([24, 12, 45, 32], dtype=int64)
>> np.array(["24", "۱۲", "45", "۳۲"]).astype('float16')
array([24., 12., 45., 32.], dtype=float16)
Upvotes: 2
Reputation: 59274
Apply unidecode
first through your numbers, and then convert using pd.to_numeric
pip install unidecode
from unidecode import unidecode
df['numbers'] = pd.to_numeric(df.age.apply(unidecode), errors='coerce')
age numbers
0 24 24
1 ۱۲ 12
2 45 45
3 ۳۲ 32
Upvotes: 3