Reputation: 1129
I have been trying to find Python code that would allow me to replace missing values in a dataframe's column. The focus of my analysis is in biostatistics so I am not comfortable with replacing values using means/medians/modes. I would like to apply the "Hot Deck Imputation" method.
I cannot find any Python functions or packages online that takes the column of a dataframe and fills missing values with the "Hot Deck Imputation" method.
I did, however, see this GitHub project and did not find it useful.
The following is an example of some of my data (assume this is a pandas dataframe):
| age | sex | bmi | anesthesia score | pain level |
|-----|-----|------|------------------|------------|
| 78 | 1 | 40.7 | 3 | 0 |
| 55 | 1 | 25.3 | 3 | 0 |
| 52 | 0 | 25.4 | 3 | 0 |
| 77 | 1 | 44.9 | 3 | 3 |
| 71 | 1 | 26.3 | 3 | 0 |
| 39 | 0 | 28.2 | 2 | 0 |
| 82 | 1 | 27 | 2 | 1 |
| 70 | 1 | 37.9 | 3 | 0 |
| 71 | 1 | NA | 3 | 1 |
| 53 | 0 | 24.5 | 2 | NA |
| 68 | 0 | 34.7 | 3 | 0 |
| 57 | 0 | 30.7 | 2 | 0 |
| 40 | 1 | 22.4 | 2 | 0 |
| 73 | 1 | 34.2 | 2 | 0 |
| 66 | 1 | NA | 3 | 1 |
| 55 | 1 | 42.6 | NA | NA |
| 53 | 0 | 37.5 | 3 | 3 |
| 65 | 0 | 31.6 | 2 | 2 |
| 36 | 0 | 29.6 | 1 | 0 |
| 60 | 0 | 25.7 | 2 | NA |
| 70 | 1 | 30 | NA | NA |
| 66 | 1 | 28.3 | 2 | 0 |
| 63 | 1 | 29.4 | 3 | 2 |
| 70 | 1 | 36 | 3 | 2 |
I would like to apply a Python function that would allow me to input a column as a parameter and return the column with the missing values replaced with imputed values using the "Hot Deck Imputation" method.
I am using this for the purpose of statistical modeling with models such as linear and logistic regression using Statsmodels.api
. I am not using this for Machine Learning.
Any help would be much appreciated!
Upvotes: 4
Views: 7866
Reputation: 15568
You can use ffill
that uses last observation carried forward
(LOCF) Hot Code Imputation.
#...
df.fillna(method='ffill', inplace=True)
Scikit-learn impute offers KNN, Mean, Max and other imputing methods. (https://scikit-learn.org/stable/modules/impute.html)
# sklearn '>=0.22.x'
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2, weights="uniform")
DF['imputed_x'] = imputer.fit_transform(DF[['bmi']])
print(DF['imputed_x'])
Upvotes: 3