Reputation: 1309
I have a data frame df
looks like this:
birth_year person
0 1980 0
1 1981 1
2 1982 2
3 1983 3
4 1984 4
the birth_year
column looks like numbers but when I check the data type
df['birth_year'].dtype
the result is dtype('O')
so I thought it might actually be a string, and tried to convert it to numbers with df['birth_year'].astype('int')
but got an error:
UnicodeEncodeError: 'decimal' codec can't encode characters in position
0-3: invalid decimal Unicode string
After a little googling I came to understand (might be wrongly) that there seems to be some invisible characters in it. when accessing the values df['birth_year'][0]
the value I got is 1980L
, rather than 1980
.
so what exactly is the data type, and how can I convert it to integers? I read somewhere that if the returned data type is dtype('O')
, it usually means it's a string, but this doesn't seem to be the case.
Upvotes: 3
Views: 5484
Reputation: 394101
You can convert normally using df['birth_year'].astype(int)
but it seems you have invalid values, using df = df.convert_objects(convert_numeric=True)
will coerce invalid values to NaN
which may or may not be what you desire as this changes the dtype to float64
rather than int64
.
It's best to look at the invalid string values to determine why they failed to convert.
So you could do df[df.convert_objects(convert_numeric).isnull()]
to get the rows that have invalid 'birth_year' values
Upvotes: 2