Olivier Ma
Olivier Ma

Reputation: 1309

Cannot change the data type in the data frame

I have a data frame df looks like this:

        birth_year  person
    0       1980         0
    1       1981         1
    2       1982         2
    3       1983         3
    4       1984         4

the birth_year column looks like numbers but when I check the data type df['birth_year'].dtype the result is dtype('O')

so I thought it might actually be a string, and tried to convert it to numbers with df['birth_year'].astype('int')but got an error:

    UnicodeEncodeError: 'decimal' codec can't encode characters in position 
    0-3: invalid decimal Unicode string

After a little googling I came to understand (might be wrongly) that there seems to be some invisible characters in it. when accessing the values df['birth_year'][0] the value I got is 1980L, rather than 1980.

so what exactly is the data type, and how can I convert it to integers? I read somewhere that if the returned data type is dtype('O'), it usually means it's a string, but this doesn't seem to be the case.

Upvotes: 3

Views: 5484

Answers (1)

EdChum
EdChum

Reputation: 394101

You can convert normally using df['birth_year'].astype(int) but it seems you have invalid values, using df = df.convert_objects(convert_numeric=True) will coerce invalid values to NaN which may or may not be what you desire as this changes the dtype to float64 rather than int64.

It's best to look at the invalid string values to determine why they failed to convert.

So you could do df[df.convert_objects(convert_numeric).isnull()] to get the rows that have invalid 'birth_year' values

Upvotes: 2

Related Questions