Why is the pandas dataframe converting integer to float datatype

Question

I have a csv file as

Wed Dec 04 11:30:04 GMT+05:30 2019,20,35.0,143455434,0
Wed Dec 04 11:30:13 GMT+05:30 2019,40,25.5,null,

I would like to load this in pandas and convert individual columns into my respective data types. This is how I do it

raw_df = pd.read_csv('raw.csv', dtype=str)
raw_df = raw_df.replace({'null':None, pd.np.nan: None})

This is my function to convert

def df_function(row):
    row['timestamp'] = parse(row['timestamp'])
    row['odometer'] = float(row['odometer']) + 1
    row['speed'] = float(row['speed'])

    if row['id'] is not None:
        row['id'] = str(row['id'])

    if row['error_code'] is not None:
        row['error_code'] = int(row['error_code'])

    return row

raw_df = raw_df.apply(df_function, axis=1)

When you print the data types of the columns you will find

timestamp     datetime64[ns, tzoffset(None, -19800)]
odometer                                     float64
speed                                        float64
id                                            object
error_code                                   float64
dtype: object

error_code is float64, though it should be int64, what is the issue here

Sociopath · Accepted Answer

As mention in pandas documents

The Integer NA support currently uses the capitalized dtype version, e.g. Int8 as compared to the traditional int8. This may be changed at a future date

You need to change your column into Int8

df = pd.DataFrame({"error_code":[1,2,5,np.nan]}) 
print(df.dtypes)

# error_code    float64
# dtype: object

df["error_code"] = df["error_code"].astype("Int8") 
print(df.dtypes)

Output:

error_code    Int8
dtype: object

Why is the pandas dataframe converting integer to float datatype

Answers (1)

Related Questions