Reputation: 127
Is it possible to impute values for a specific column?
For example, if I have 3 columns:
Upvotes: 5
Views: 19039
Reputation: 5273
If you have a dataframe with missing data in multiple columns, and you want to impute a specific column based on the others, you can impute everything and take that specific column that you want:
from sklearn.impute import KNNImputer
import pandas as pd
imputer = KNNImputer()
imputed_data = imputer.fit_transform(df) # impute all the missing data
df_temp = pd.DataFrame(imputed_data)
df_temp.columns = df.columns
df['COL_TO_IMPUTE'] = df_temp['COL_TO_IMPUTE'] # update only the desired column
Another method would be to transform all the missing data in the desired column to a unique character that is not contained in the other columns, say #
if the data is strings (or max + 1
if the data is numeric), and then tell the imputer that your missing data is #
:
from sklearn.impute import KNNImputer
import pandas as pd
cols_backup = df.columns
df['COL_TO_IMPUTE'].fillna('#', inplace=True) # replace all missing data in desired column with with '#'
imputer = KNNImputer(missing_values='#') # tell the imputer to consider only '#' as missing data
imputed_data = imputer.fit_transform(df) # impute all '#'
df = pd.DataFrame(data=imputed_data, columns=cols_backup)
Upvotes: 4
Reputation: 11
As you said some of columns are have no missing data that means when you use any of imputation methods such as mean, KNN, or other will just imputes missing values in column C. only you have to do pass your data with missing to any of imputation method then you will get full data with no missing.
imr = SimpleImputer(missing_values=np.NaN, strategy='mean')
imr = imr.fit(with_missing)
SimpleImputer()
imputed_data = imr.transform(with_missing)
or with kNN imputer
imputer_KNN = KNNImputer(missing_values="NaN", n_neighbors=3, weights="uniform", metric="masked_euclidean")
imputed_data = imputer_KNN.fit_transform(with_missing)
Upvotes: -1
Reputation: 617
You can use numpy.ravel:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=0, strategy="mean", axis=0)
df["C"] = imp.fit_transform(df[["C"]]).ravel()
Upvotes: 12