william007
william007

Reputation: 18537

Are pipelines capable of cacheing intermediate results?

I use pandas to do feature extraction for machine learning. I hope to achieve the following: Consider I have five data processing steps done sequentially, and I execute thme once. Eesults will be saved automatically. Next time, if I change the fourth step, the library will automatically start from the third step.

Would this cache function be supported in Pandas or sklearn.pipeline.Pipeline or other data processing libraries naturally without our need to save them explicitly?

Upvotes: 1

Views: 445

Answers (2)

user17561811
user17561811

Reputation: 1

VevestaX (https://github.com/Vevesta/VevestaX) can be used to track features and parameters used in the Machine learning experiment. It can be installed with

pip install vevestaX

Has easy commands to track features used. Example:

V.dataSourcing = df

In a jupyter notebook, this command would need to be run once and it will capture the features. Or to capture feature engineering, you just need to run the following command

V.featureEngineering = df

or

V.fe = df

Lastly the variables can be captured by writing it between the code block V.start() and V.end()

V.start()
epochs = 10
V.end()

Upvotes: 0

moomima
moomima

Reputation: 1360

MLFlow Tracking has some nice features that seem to be lacking in Dagster (a record of the current git commit, ML metrics, etc.) They also integrate nicely with Databricks that allows for easy cluster deployment. However, they really lack in building complicated pipelines, in which Dagster excels.

Is there a way to get "the best of all worlds"? That is, integrate Dagster with MLFlow and have it thus run on Databricks?

Or is there a good alternative?

Upvotes: 0

Related Questions