Reputation: 5282
I am read about Kubeflow, and for create components there are two ways.
But there isn't an explication about why I should to use one or another, for example for load a Container-based, I need to generate a docker image push, and load in the pipeline the yaml, with the specification, but with function-based, I only need import the function.
And in order to apply ci-cd with the latest version, if I have a container-based, I can have a repo with all yml and load with load_by_url, but if they are a function, I can have a repo with all and load as a package too.
So what do you think that is the best approach container-based or function-based.
Thanks.
Upvotes: 2
Views: 1356
Reputation: 6812
But there isn't an explication about why I should to use one or another, for example for load a Container-based, I need to generate a docker image push, and load in the pipeline the yaml, with the specification, but with function-based, I only need import the function.
There are some misconceptions here.
There is only one kind of component under the hood - container-based component (there are also graph components, but this is irrelevant here).
However, most of our users like python and do not like building container. This is why I've developed a feature called "Lightweight python components" which generates ComponentSpec/component.yaml
from a python function source code. The generated component basically runs python3 -u -c '<your function>; <command-line parsing>' arg1 arg2 ...
.
There is a misconception that "function-based components are different from component.yaml
files".
No, it's the same format. You're supposed to save the generated component into a file for sharing: create_component_from_func(my_func, output_component_file='component.yaml')
. After your code stabilizes, you should upload the code and the component.yaml
to GitHub or other place and use load_component_from_url
to load that component.yaml
in pipelines.
Check the component.yaml
files in the KFP repo. More than half of the component.yaml
files are Lightweight components - they're generated from python functions.
component.yaml
are intended for sharing the components. They're declarative, portable, indexable, safe, language-agnostic etc. You should always publish component.yaml
files. If component.yaml
is generated from a python function, then it's good practice to put component.py
alongside so that the component can be easily regenerated when making changes.
The decision whether to create component using Lightweight python component feature or not is very simple:
Is you code in a self-contained python function (not a CLI program yet)? Do you want to avoid building, pushing and maintaining containers? If yes, then the Lightweight python component feature (create_component_from_func
) can help you and generate the component.yaml
for you.
Otherwise, write component.yaml
yourself.
Upvotes: 3
Reputation: 449
The short answer is it depends, but a more nuance answer is depends what you want to do with the component.
As base knowledge, when a KFP pipeline is compiled, it's actually a series of different YAMLs that are launched by Argo Workflows. All of these needs to be container based to run on Kubernetes, even if the container itself has all python.
A function to Python Container Op is a quick way to get started with Kubeflow Pipelines. It was designed to model after Airflow's python-native DSL. It will take your python function and run it within a defined Python container. You're right it's easier to encapsulate all your work within the same Git folder. This set up is great for teams just getting started with KFP and don't mind some boilerplate to get going quickly.
Components really become powerful when your team needs to share work, or you have an enterprise ML platform that is creating template logic of how to run specific jobs in a pipeline. The components can be separately versioned and built to use on any of your clusters in the same way (underlying container should be stored in docker hub or ECR, if you're on AWS). There are inputs/outputs to prescribe how the run will execute using the component. You can imagine a team in Uber might use a KFP to pull data for number of drivers in a certain zone. The inputs to the component could be Geo coordinate box and also time of day of when to load the data. The component saves the data to S3, which then is loaded to your model for training. Without the component, there would be quite a bit of boiler plate that would need to copy the code across multiple pipelines and users.
I'm a former PM at AWS for SageMaker and open source ML integrations, and this is sharing from my experience looking at enterprise set ups.
Upvotes: 4