Implementation of sklearn.impute.IterativeImputer

Consider data which contains some nan below:

Column-1    Column-2    Column-3    Column-4    Column-5
0   NaN 15.0    63.0    8.0 40.0
1   60.0    51.0    NaN 54.0    31.0
2   15.0    17.0    55.0    80.0    NaN
3   54.0    43.0    70.0    16.0    73.0
4   94.0    31.0    94.0    29.0    53.0
5   99.0    52.0    77.0    91.0    58.0
6   84.0    19.0    36.0    NaN 97.0
7   41.0    91.0    62.0    67.0    68.0
8   44.0    38.0    27.0    53.0    37.0
9   58.0    NaN 63.0    57.0    28.0
10  66.0    68.0    89.0    36.0    47.0
11  7.0 81.0    5.0 99.0    16.0
12  43.0    55.0    64.0    88.0    NaN
13  8.0 90.0    91.0    44.0    4.0
14  29.0    52.0    94.0    71.0    47.0
15  22.0    21.0    68.0    61.0    38.0
16  76.0    36.0    70.0    99.0    50.0
17  38.0    31.0    66.0    79.0    99.0
18  94.0    22.0    92.0    39.0    58.0

I want to replace nan in the data using sklearn.impute.IterativeImputer. A friend helped me with the code below:

imp = IterativeImputer(missing_values=np.nan, sample_posterior=False, 
                                 max_iter=10, tol=0.001, 
                                 n_nearest_features=4, initial_strategy='median')
imp.fit(data)
imputed_data = pd.DataFrame(data=imp.transform(data), 
                             columns=['Column-1', 'Column-2', 'Column-3', 'Column-4', 'Column-5'],
                             dtype='int')

The imputed_data is:


Column-1    Column-2    Column-3    Column-4    Column-5
0   59  15  63  8   40
1   60  51  66  54  31
2   15  17  55  80  48
3   54  43  70  16  73
4   94  31  94  29  53
5   99  52  77  91  58
6   84  19  36  59  97
7   41  91  62  67  68
8   44  38  27  53  37
9   58  46  63  57  28
10  66  68  89  36  47
11  7   81  5   99  16
12  43  55  64  88  47
13  8   90  91  44  4
14  29  52  94  71  47
15  22  21  68  61  38
16  76  36  70  99  50
17  38  31  66  79  99
18  94  22  92  39  58

From the IterativeImputer documentation, the default estimator is BayesianRidge(). But if I use other estimators such as estimator=ExtraTreesRegressor(n_estimators=10, random_state=0) like in the code below, it returns a warning message. The code:

imp = IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=10, random_state=0), missing_values=np.nan, sample_posterior=False, 
                                 max_iter=10, tol=0.001, 
                                 n_nearest_features=4, initial_strategy='median')
imp.fit(data)

The message:

C:\Users\...\sklearn\impute\_iterative.py:599: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached. " reached.", ConvergenceWarning).

My question: is this a correct approach or should I do something to fix the warning message?
Thank you.

Upvotes: 8

Answers (4)

Brayam Matias

Reputation: 1

If what you want is to avoid errors due to NaN values, you can use the following line:

   missing_data = missing_data.replace(' ', np.nan)

To solve the problem of filling missing data and make it work with the available features, we did the following:

from sklearn.experimental import enable_iterative_imputer
from sklearn.linear_model import LinearRegression
from sklearn.impute import IterativeImputer
import pandas as pd
import numpy as np

data = pd.read_csv("D:\\documents\\8thSemester\\MachineLearning\\1stPartial\\project1stPartial\\housing_in_london_yearly_variables_no_missing.csv")
missing_data_1 = pd.read_csv("D:\\documents\\8thSemester\\MachineLearning\\1stPartial\\project1stPartial\\housing_in_london_yearly_variables_normal.csv")
missing_data = missing_data_1[missing_data_1.isnull().any(axis=1)]
missing_data = missing_data.replace(' ', np.nan)

# Create an imputer that uses linear regression to estimate missing values
imputer = IterativeImputer(estimator=LinearRegression())

# Fit the imputer to your complete data
imputer.fit(data)

# Transform your data with missing values
missing_data_imputed = imputer.transform(missing_data)

# 'missing_data_imputed' is a numpy array, so you can convert it back to a DataFrame if needed
missing_data_imputed = pd.DataFrame(missing_data_imputed, columns=missing_data.columns)

# Save the DataFrame to a CSV file
missing_data_imputed.to_csv('missing_data_imputed.csv', index=False)
# Now, 'missing_data_imputed' is your original DataFrame, but with missing values filled in with the imputer's predictions

Upvotes: 0

akhil penta

Reputation: 1100

You are getting this error because of the parameters max_iter=10 & tol=0.001set for IterativeImputer().

The stopping criterion (abs(max(X_t - X_{t-1}))/abs(max(X[known_vals])) < tol) is not met for 10 number of iterations(max_iter=10).

Refer to the description of max_iter in the parameters section of sklearn.impute.IterativeImputer documentation.

One workaround to overcome this error is setting the max_iter parameter value higher.

Upvotes: 5

mel1

Reputation: 73

They are having the same issue here:

https://github.com/scikit-learn/scikit-learn/issues/14338

Upvotes: 3

Deepak Behera

Reputation: 1

Have you tried to import ExtraTreesRegressor first. It should work fine.

from sklearn.ensemble import ExtraTreesRegressor.

Also check for the version of scikit learn. It should be 0.21.1 and above.

Upvotes: -1

Implementation of sklearn.impute.IterativeImputer

Answers (4)

Related Questions