Reputation: 1
I'm working with a dataset containing details about used cars, and I've encountered several missing values in the Fuel_Type column. The possible values include 'Gasoline', 'E85 Flex Fuel', 'Hybrid', 'Diesel', and others. Currently, my data has over 4,000 electric vehicles, fewer than 50 gasoline vehicles, and some hybrids with missing Fuel_Type entries. Additionally, some entries contain non-standard values like '–' and 'not supported'. Accurately filling these missing values is crucial for my analysis, as they significantly impact the results.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
# Sample DataFrame
data = {
'Car': ['Toyota', 'Honda', 'Tesla', None, 'Ford'],
'Fuel_Type': ['Gasoline', 'E85 Flex Fuel', np.nan, 'Hybrid', None],
'Transmission': ['Automatic', None, 'Automatic', 'Manual', 'Manual']
}
df = pd.DataFrame(data)
# Initial imputation attempt
imputer = SimpleImputer(strategy='most_frequent')
df['Fuel_Type'] = imputer.fit_transform(df[['Fuel_Type']])
print(df)
Upvotes: -1
Views: 75
Reputation: 262214
You could fillna
with empty strings (''
) and define those as the missing values, also slice the output to make it 1D:
imputer = SimpleImputer(strategy='most_frequent', missing_values='')
df['Fuel_Type'] = imputer.fit_transform(df[['Fuel_Type']].fillna(''))[:, 0]
Output:
Car Fuel_Type Transmission
0 Toyota Gasoline Automatic
1 Honda E85 Flex Fuel None
2 Tesla E85 Flex Fuel Automatic
3 None Hybrid Manual
4 Ford E85 Flex Fuel Manual
If you want to handle all columns:
imputer = SimpleImputer(strategy='most_frequent', missing_values='')
df[:] = imputer.fit_transform(df.fillna(''))
Output:
Car Fuel_Type Transmission
0 Toyota Gasoline Automatic
1 Honda E85 Flex Fuel Automatic
2 Tesla E85 Flex Fuel Automatic
3 Ford Hybrid Manual
4 Ford E85 Flex Fuel Manual
Upvotes: 0