Reputation: 23
I'm using a Multiple Imputer from sklearn library to impute some missing values from rain datasets, containing the rain stations and the rain data (each station a column, and the index are DateTime). I was able to run the IterativeImputer and get an output with all my missing values filled. The problem is that the output contains negative values. It's possible to change de min_value that he imputes, but it sets a unique value for all the columns. I wanna set a min_value based on the minimal value for each column before the imputation. There is a response here in Stack for that answer, but I've no clue how to do it.
The code I'm using:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
#Babitonga's region stations
babi_ana = pd.read_csv(all_csv_files[0]).set_index("Time") #Here a read the csv data
# Transforming my index to datetime
babi_ana.index = pd.to_datetime(babi_ana.index)
mask = (babi_ana.index > ini1) & (babi_ana.index <= fim1) #Selecting the date range
babi_ana1 = babi_ana.loc[mask]
# Applying the imputer
imputer_data = IterativeImputer(random_state = 0,skip_complete=True,sample_posterior=True, max_iter = 10, missing_values = np.nan)
data = babi_ana1
minimum = data.iloc[:,:].min(axis=0) #No negative values from the original
imputer_data.fit(data.iloc[:,:].values)
data_imputed = imputer_data.transform(data.iloc[:,:].values)
# Here I realize the output has negative values
data_imputed = pd.DataFrame(data_imputed)
minimun_after = data_imputed.iloc[:,:].min(axis=0) #several negative values, except for 2 stations
I wanna be able to use the min_value
and max_value
based on the max and min from each station before the imputation, like this:
max_imputer = data.iloc[:,:].max(axis = 0)
min_imputer = data.iloc[:,:].min(axis = 0)
Upvotes: 1
Views: 2215
Reputation: 366
Great improvements on the question :).
I've read a bit more about the IterativeImputer
here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer.
It seems that it can take a min_value
parameter on the constructor, it expects either a float or an array. If you have a minimum value for all features (columns) of your data, you can just use the float alternative.
For example, if you want the minimum possible value to be 0
in all features (columns), you could change your code to:
imputer_data = IterativeImputer(random_state = 0, skip_complete = True,sample_posterior = True, max_iter = 10, missing_values = np.nan, min_value = 0)
On the other hand, if you want different minimum values for different features, you need to use an array as long as the number of features. For example: if you have 2 features and the minimum values should be 0 and 5, respectively, you would change your code to:
imputer_data = IterativeImputer(random_state = 0, skip_complete = True,sample_posterior = True, max_iter = 10, missing_values = np.nan, min_value = [0, 5])
You can do the same for the max_value
parameter.
The first change should make sure you don't get any more negative imputed values.
If you want to use the min
and max
values based on the data you already have, the first step should be to write code that goes over that feature in your data and gets both the minimum and maximum values there. It should be the same as getting min and max values in an array, you can probably find lots of Python examples for that if you aren't sure how to do it.
As a final note, it's still a bit weird to me how the Imputer output negative data after fitting with only positive data. So I'd double check that data.iloc[:,:].values
really is the data you want in the format the Imputer is expecting.
Upvotes: 2