Reputation: 793
I have a five separate pandas dataframes that I've put inside a dictionary. I want to run five separate IsolationForest models in scikit-learn with different sets of parameters for each model. However, I don't want to run each model separtely.
So my question is, how can I run these models and get the decision functions and predictions for all dataframes in on go. My attempt at doing so is below.
# parameters for each of the five models in a list. The index position in the list
# represents a dataset's parameters, from dataset0 through dataset4
n_estimators = [150, 200, 125, 125, 125]
max_samples = [0.70, 0.70, 0.80, 0.70, 0.70]
max_features = [1, 4, 2, 2, 3]
contamination = [0.05, 0.06, 0.05, 0.07, 0.05]
# numeric columns
num_columns = list(subset_features[1:])
# column transformer
num_transformer = Pipeline([('impute', IterativeImputer()), ('scale', StandardScaler())])
ct = ColumnTransformer([('num_pipeline', num_transformer, num_columns)])
clf = Pipeline([('ct', ct),
('iforest', IsolationForest(n_estimators=n_estimators[i],
max_samples=max_samples[i],
max_features=max_features[i],
contamination=contamination[i],
n_jobs=4,
random_state=None))])
clf_res = {}
for i, df in enumerate(dfs.values()):
print('starting idx:', i)
clf_res[i] = clf.fit(df)
The issue I have is that it is not iterating through the different sets of parameters as the dataframes change from iteration to iteration. See below:
{0: Pipeline(steps=[('ct',
ColumnTransformer(transformers=[('numeric_pipeline',
Pipeline(steps=[('impute',
IterativeImputer()),
('scale',
StandardScaler())]),
['V1', 'V2', 'V3',
'V4'])])),
('iforest',
IsolationForest(contamination=0.05, max_features=1,
max_samples=0.7, n_estimators=125,
n_jobs=4))]),
1: Pipeline(steps=[('ct',
ColumnTransformer(transformers=[('numeric_pipeline',
Pipeline(steps=[('impute',
IterativeImputer()),
('scale',
StandardScaler())]),
['V1', 'V2', 'V3',
'V4'])])),
('iforest',
IsolationForest(contamination=0.05, max_features=1,
max_samples=0.7, n_estimators=125,
n_jobs=4))])
So what I want is that the parameters will change as the dataframes changes.
Upvotes: 0
Views: 357
Reputation: 793
After careful review of my code, and a little bit of googling, I found out what was wrong with my code. I am sharing it here for anyone else who may have similar problems in the future.
The only change I made to my code was to bring clf pipeline
into the for loop.
# parameters for each of the five models in a list. The index position in the list
# represents a dataset's parameters, from dataset0 through dataset3
n_estimators = [150, 200, 125, 125, 125]
max_samples = [0.70, 0.70, 0.80, 0.70, 0.70]
max_features = [1, 4, 2, 2, 3]
contamination = [0.05, 0.06, 0.05, 0.07, 0.05]
# numeric columns
num_columns = list(subset_features[1:])
# column transformer
num_transformer = Pipeline([('impute', IterativeImputer()), ('scale', StandardScaler())])
ct = ColumnTransformer([('num_pipeline', num_transformer, num_columns)])
clf_res = {}
for i, df in enumerate(dfs.values()):
print('starting idx:', i)
clf = Pipeline([('ct', ct),
('iforest', IsolationForest(n_estimators=n_estimators[i],
max_samples=max_samples[i],
max_features=max_features[i],
contamination=contamination[i],
n_jobs=4,
random_state=None))])
clf_res[i] = clf.fit(df)
Sample of correct output is below
{0: Pipeline(steps=[('ct',
ColumnTransformer(transformers=[('numeric_pipeline',
Pipeline(steps=[('impute',
IterativeImputer()),
('scale',
StandardScaler())]),
['v1', 'v2', 'v3',
'v4'])])),
('iforest',
IsolationForest(contamination=0.05, max_features=1,
max_samples=0.70,
n_estimators=150, n_jobs=4))]),
1: Pipeline(steps=[('ct',
ColumnTransformer(transformers=[('numeric_pipeline',
Pipeline(steps=[('impute',
IterativeImputer()),
('scale',
StandardScaler())]),
['v1', 'v2', 'v3',
'v4'])])),
('iforest',
IsolationForest(contamination=0.05, max_features=4,
max_samples=0.7, n_estimators=200,
n_jobs=4))]),
Upvotes: 0