Sagar SN
Sagar SN

Reputation: 35

How to connect kubeflow pipeline components

I want to establish a pipeline connection between the components by passing any kind of data just to make it look like organized like flowchart with arrows. Right now it is like belowenter image description here

Irrespective of whether the docker container generates output or not I would want pass some sample data between the components. However If any changes is required in the docker container code or the .yaml please let me know

KFP Code

import os
from pathlib import Path
import requests

import kfp

#Load the component
component1 = kfp.components.load_component_from_file('comp_typed.yaml')
component2 = kfp.components.load_component_from_file('component2.yaml')
component3 = kfp.components.load_component_from_file('component3.yaml')
component4 = kfp.components.load_component_from_file('component4.yaml')

#Use the component as part of the pipeline
@kfp.dsl.pipeline(name='Document Processing Pipeline', description='Document Processing Pipeline')
def data_passing():
    task1 = component1()
    task2 = component2(task1.output)
    task3 = component3(task2.output)
    task4 = component4(task3.output)

comp_typed.yaml code

name: DPC
description: This is an example
implementation:
  container:
    image: gcr.io/pro1-in-us/dpc_comp1@sha256:3768383b9cd694936ef00464cb1bdc7f48bc4e9bbf08bde50ac7346f25be15de
    command: [python3, /dpc_comp1.py,]

component2.yaml

name: Custom_Plugin_1
description: This is an example
implementation:
  container:
    image: gcr.io/pro1-in-us/plugin1@sha256:16cb4aa9edf59bdf138177d41d46fcb493f84ce798781125dc7777ff5e1602e3
    command: [python3, /plugin1.py,]

I tried this and this but could not achieve anything except for error. I am new to python and kubeflow. What code changes should I make to pass data between all 4 components using KFP SDK. Data can be a file/string

Let's Suppose, Component 1 downloads a .pdf file from gs bucket can i feed the same file into next downstream component?. Component 1 downloads file to '/tmp/doc_pages' location of component 1 docker container which i believe is local to that particular contain and the down stream components can not read them?

Upvotes: 4

Views: 4049

Answers (3)

katakuri
katakuri

Reputation: 397

If you don't want to use dependency through outputs or passing any data between components, you can refer to PVC in previous step to explicitly call out a dependency.

Example: You can create a PVC for storing data.

vop = dsl.VolumeOp(name="pvc",
                   resource_name="pvc", size=<size>, 
                   modes=dsl.VOLUME_MODE_RWO,)

Use it in a component:

download = dsl.ContainerOp(name="download",image="", 
                           command=[" "], arguments=[" "], 
                           pvolumes={"/data": vop.volume},)

Now you can call out dependency between download and train as follows:

train = dsl.ContainerOp(name="train",image="", 
                        command=[" "], arguments=[" "], 
                        pvolumes={"/data": download.pvolumes["/data"]},)

Upvotes: 0

Ark-kun
Ark-kun

Reputation: 6787

In addition to the Amy's excellent answer:

Your pipeline is correct. The best way to establish a dependency between components is to establish data dependency.

Let's look at your pipeline code:

task2 = component2(task1.output)

You're passing output of task1 to component2. This should result in a dependency that you want. But there are couple of problems (and your pipeline will show compilation errors if you try to compile it):

  1. component1 needs to have an output
  2. component2 needs to have an input
  3. component2 needs to have an output (so that you can pass it to component3)

Etc.

Let's add them:

name: DPC
description: This is an example
outputs:
- name: output_1
implementation:
  container:
    image: gcr.io/pro1-in-us/dpc_comp1@sha256:3768383b9cd694936ef00464cb1bdc7f48bc4e9bbf08bde50ac7346f25be15de
    command: [python3, /dpc_comp1.py, --output-1-path, {outputPath: output_1}]
name: Custom_Plugin_1
description: This is an example
inputs:
- name: input_1
outputs:
- name: output_1
implementation:
  container:
    image: gcr.io/pro1-in-us/plugin1@sha256:16cb4aa9edf59bdf138177d41d46fcb493f84ce798781125dc7777ff5e1602e3
    command: [python3, /plugin1.py, --input-1-path, {inputPath: input_1}, --output-1-path, {outputPath: output_1}]

With these changes, your pipeline should compile and display the dependencies that you want.

Please check the tutorial about creating components from command-line programs.

Upvotes: 1

Amy U.
Amy U.

Reputation: 2237

This notebook, which describes how to pass data between KFP components, may be useful. It includes the concept of 'small data', to pass directly; vs 'large data' that you write to a file, then— as shown in the example notebook— the paths for the input and output files are chosen by the system and are passed into the function (as strings).

If you don't need to pass data between steps, but want to specify a step ordering dependency (e.g. op2 doesn't run until op1 is finished) you can indicate this in your pipeline definition like so:

op2.after(op1)

Upvotes: 3

Related Questions