Reputation: 4319
I have a csv file as
Wed Dec 04 11:30:04 GMT+05:30 2019,20,35.0,143455434,0
Wed Dec 04 11:30:13 GMT+05:30 2019,40,25.5,null,
I would like to load this in pandas and convert individual columns into my respective data types. This is how I do it
raw_df = pd.read_csv('raw.csv', dtype=str)
raw_df = raw_df.replace({'null':None, pd.np.nan: None})
This is my function to convert
def df_function(row):
row['timestamp'] = parse(row['timestamp'])
row['odometer'] = float(row['odometer']) + 1
row['speed'] = float(row['speed'])
if row['id'] is not None:
row['id'] = str(row['id'])
if row['error_code'] is not None:
row['error_code'] = int(row['error_code'])
return row
raw_df = raw_df.apply(df_function, axis=1)
When you print the data types of the columns you will find
timestamp datetime64[ns, tzoffset(None, -19800)]
odometer float64
speed float64
id object
error_code float64
dtype: object
error_code is float64, though it should be int64, what is the issue here
Upvotes: 1
Views: 420
Reputation: 13426
As mention in pandas
documents
The Integer NA support currently uses the capitalized dtype version, e.g. Int8 as compared to the traditional int8. This may be changed at a future date
You need to change your column into Int8
df = pd.DataFrame({"error_code":[1,2,5,np.nan]})
print(df.dtypes)
# error_code float64
# dtype: object
df["error_code"] = df["error_code"].astype("Int8")
print(df.dtypes)
Output:
error_code Int8
dtype: object
Upvotes: 1