Reputation: 1257
I am working on a project that is looking for a lean Python AutoML pipeline implementation. As per project definition, data entering the pipeline is in the format of serialised business objects, e.g. (artificial example):
property.json:
{
"area": "124",
"swimming_pool": "False",
"rooms" : [
... some information on individual rooms ...
]
}
Machine learning targets (e.g. predicting whether a property has a swimming pool based on other attributes) are stored within the business object rather than delivered in a separate label vector and business objects may contain observations which should not be used for training.
I need a pipeline engine which supports initial (or later) pipeline steps that i) dynamically change the targets in the machine learning problem (e.g. extract from input data, threshold real values) and ii) resample input data (e.g. upsampling, downsampling of classes, filtering observations).
The pipeline ideally should look as follows (pseudocode):
swimming_pool_pipeline = Pipeline([
("label_extractor", SwimmingPoolExtractor()), # skipped in prediction mode
("sampler", DataSampler()), # skipped in prediction mode
("featurizer", SomeFeaturization()),
("my_model", FitSomeModel())
])
swimming_pool_pipeline.fit(training_data) # not passing in any labels
preds = swimming_pool_pipeline.predict(test_data)
The pipeline execution engine needs to fulfill/allow for the following:
.fit()
) SwimmingPoolExtractor
extracts target labels from the input training data and passes labels on (alongside independent variables);DataSampler()
uses the target labels extracted in the previous step to sample observations (e.g. could do minority upsampling or filter observations);SwimmingPoolExtractor()
does nothing and just passes on the input data;DataSampler()
does nothing and just passes on the input data;For example, assume that the data looks as follows:
property.json:
"properties" = [
{ "id_": "1",
"swimming_pool": "False",
...,
},
{ "id_": "2",
"swimming_pool": "True",
...,
},
{ "id_": "3",
# swimming_pool key missing
...,
}
]
The application of SwimmingPoolExtractor()
would extract something like:
"labels": [
{"id_": "1", "label": "0"},
{"id_": "2", "label": "1"},
{"id_": "3", "label": "-1"}
]
from the input data and pass it set these as the machine learning pipeline's "targets".
The application of DataSampler()
could for example further include logic that removes any training instance from the entire set of training data which did not contain any swimming_pool
-key (label = -1
).
Subsequent steps should use the modified training data (filtered, not including observation with id_=3
) to fit the model. As stated above, in prediction mode, the DataSampler
and SwimmingPoolExtractor
would just pass through input data
To my knowledge, neither neuraxle
nor sklearn
(for the latter I am sure) offer pipeline steps that meet the required functionality (from what I have gathered so far neuraxle
must at least have support for slicing data, given it implements cross-validation meta-estimators).
Am I missing something, or is there a way to implement such functionality in either of the pipeline models? If not, are there alternatives to the listed libraries within the Python ecosystem that are reasonably mature and support such usecases (leaving aside issues that might arise from designing pipelines in such a manner)?
Upvotes: 2
Views: 190
Reputation: 10948
"Am I missing something, or is there a way to implement such functionality"
x
into y
within the pipeline (thus effectively not passing in any labels to fit
as you want to do).Provided that your input data passed in "fit" is an iterable of something (e.g.: don't pass the whole json at once, at least make something that can be iterated on). At worst, pass a list of IDs and do a step that will convert the IDs to something else using an object that can go take the json by itself to do whatever it needs with the passed IDs, for instance.
from neuraxle.pipeline import Pipeline
class SwimmingPoolExtractor(NonFittableMixin, InputAndOutputTransformerMixin, BaseStep): # Note here: you may need to delete the NonFittableMixin from the list here if you encounter problems, and define "fit" yourself rather than having it provided here by default using the mixin class.
def transform(self, data_inputs):
# Here, the InputAndOutputTransformerMixin will pass
# a tuple of (x, y) rather than just x.
x, _ = data_inputs
# Please note that you should pre-split your json into
# lists before the pipeline so as to have this assert pass:
assert hasattr(x, "__iter__"), "input data must be iterable at least."
x, y = self._do_my_extraction(x) # TODO: implement this as you wish!
# Note that InputAndOutputTransformerMixin expects you
# to return a (x, y) tuple, not only x.
outputs = (x, y)
return outputs
class DataSampler(NonFittableMixin, BaseStep):
def transform(self, data_inputs):
# TODO: implement this as you wish!
data_inputs = self._do_my_sampling(data_inputs)
assert hasattr(x, "__iter__"), "data must stay iterable at least."
return data_inputs
swimming_pool_pipeline = Pipeline([
TrainOnlyWrapper(SwimmingPoolExtractor()), # skipped in `.predict(...)` call
TrainOnlyWrapper(DataSampler()), # skipped in `.predict(...)` call
SomeFeaturization(),
FitSomeModel()
])
swimming_pool_pipeline.fit(training_data) # not passing in any labels!
preds = swimming_pool_pipeline.predict(test_data)
fit
:auto_ml = AutoML(
swimming_pool_pipeline,
validation_splitter=ValidationSplitter(0.20), # You can create your own splitter class if needed to replace this one. Dig in the source code of Neuraxle an see how it's done to create your own replacement.
refit_trial=True,
n_trials=10,
epochs=1,
cache_folder_when_no_handle=str(tmpdir),
scoring_callback=ScoringCallback(mean_squared_error, higher_score_is_better=False) # mean_squared_error from sklearn
hyperparams_repository=InMemoryHyperparamsRepository(cache_folder=str(tmpdir))
)
best_swimming_pool_pipeline = auto_ml.fit(training_data).get_best_model()
preds = best_swimming_pool_pipeline.predict(test_data)
If you want to use caching, you should not define any transform
methods, and instead you should define handle_transform
methods (or related methods) so as to keep the order of the data "ID"s when you resample the data. Neuraxle is made to process iterable data, and this is why I've done some asserts above so as to ensure your json is already preoprocessed such that it is some kind of list of something.
Upvotes: 1