Reputation: 18537
I use pandas to do feature extraction for machine learning. I hope to achieve the following: Consider I have five data processing steps done sequentially, and I execute thme once. Eesults will be saved automatically. Next time, if I change the fourth step, the library will automatically start from the third step.
Would this cache function be supported in Pandas or sklearn.pipeline.Pipeline
or other data processing libraries naturally without our need to save them explicitly?
Upvotes: 1
Views: 445
Reputation: 1
VevestaX (https://github.com/Vevesta/VevestaX) can be used to track features and parameters used in the Machine learning experiment. It can be installed with
pip install vevestaX
Has easy commands to track features used. Example:
V.dataSourcing = df
In a jupyter notebook, this command would need to be run once and it will capture the features. Or to capture feature engineering, you just need to run the following command
V.featureEngineering = df
or
V.fe = df
Lastly the variables can be captured by writing it between the code block V.start() and V.end()
V.start()
epochs = 10
V.end()
Upvotes: 0
Reputation: 1360
MLFlow Tracking has some nice features that seem to be lacking in Dagster (a record of the current git commit, ML metrics, etc.) They also integrate nicely with Databricks that allows for easy cluster deployment. However, they really lack in building complicated pipelines, in which Dagster excels.
Is there a way to get "the best of all worlds"? That is, integrate Dagster with MLFlow and have it thus run on Databricks?
Or is there a good alternative?
Upvotes: 0