Reputation: 65
I have a data file I load and process with a pandas Dataframe. My code works, but I'm wondering if there is a more efficient way to achieve what I'm trying to do. My code is as follows:
df = pd.read_csv("file_name.data", sep="\s+", names=["A","B","Horsepower"])
df1 = df[df.Horsepower != '?']
df2 = df1["Horsepower"].apply(pd.to_numeric)
df.replace('?', df2.mean())
In the data itself, the Horsepower column comes with several missing values that have been replaced with '?'. The above code replaces these '?' values with the mean of the Horsepower column, excluding the '?' values.
With that established, is there a more efficient way to replace the '?' values in "Horsepower" with the mean of the "Horsepower" column?
Upvotes: 2
Views: 2287
Reputation: 59529
This will work and will convert anything that cant be converted into a number to NaN
upon averaging.
df['Horsepower'] = df['Horsepower'].replace('?',
np.mean(pd.to_numeric(df['Horsepower'], errors='coerce')))
Upvotes: 2