Phil Chae
Phil Chae

Reputation: 1136

Kubeflow pipeline unable to pick up metrics

I'm having the logs below. In my training code, I save my accuracy to path /accuracy.json and save the metric containing this accuracy to path /mlpipeline-metrics.json. Json files were correctly created, but kubeflow pipeline (or Argo from which the upper logs are coming from) seems unable to pick up the Json file.

│ wait time="2020-09-03T04:07:19Z" level=info msg="Copying /mlpipeline-metrics.json from container base image layer to /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Archiving :/mlpipeline-metrics.json to /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="sh -c docker cp -a :/mlpipeline-metrics.json - | gzip > /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=warning msg="path /mlpipeline-metrics.json does not exist (or /mlpipeline-metrics.json is empty) in archive /argo/outputs/artifacts/mlpipeline-metri
│ cs.tgz"
│ wait time="2020-09-03T04:07:19Z" level=warning msg="Ignoring optional artifact 'mlpipeline-metrics' which does not exist in path '/mlpipeline-metrics.json': path /mlpipeline-metrics.json
│ does not exist (or /mlpipeline-metrics.json is empty) in archive /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Staging artifact: transformer-pytorch-train-job-acc"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Copying /accuracy.json from container base image layer to /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Archiving :/accuracy.json to /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="sh -c docker cp -a :/accuracy.json - | gzip > /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tgz"
│ wait time="2020-09-03T04:07:19Z" level=warning msg="path /accuracy.json does not exist (or /accuracy.json is empty) in archive /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tg
│ z"
│ wait time="2020-09-03T04:07:19Z" level=error msg="executor error: path /accuracy.json does not exist (or /accuracy.json is empty) in archive /argo/outputs/artifacts/transformer-pytorch-tr
│ ain-job-acc.tgz\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/er
│ rors/errors.go:55\ngithub.com/argoproj/argo/workflow/executor/docker.(*DockerExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:66\ngithub.com/argop
│ roj/argo/workflow/executor.(*WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:344\ngithub.com/argoproj/argo/workflow/executor.(*Workflo
│ wExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:245\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveArtifacts\n\t/go/src/gith
│ ub.com/argoproj/argo/workflow/executor/executor.go:231\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:54\n
│ github.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/
│ src/github.com/spf13/cobra/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/sr
│ c/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/u
│ sr/local/go/src/runtime/asm_amd64.s:1333"

The pipeline code I'm using is as below. If I understood correctly, the container will save the metrics and acc to the Json file path I specified. Then, Argo will pick these files up and render the output in my Kubeflow UI. However, getting the logs above confuse me. Any ideas or suggestions will help me a lot.

@dsl.pipeline(
    name="PyTorch Job",
    description="Example Tutorial"
)
def containerop_basic():
    op = dsl.ContainerOp(
        name='pytorch-train-job',
        image='From our ECR',
        file_outputs={
          'acc': '/accuracy.json',
          'mlpipeline-metrics': '/mlpipeline-metrics.json'
        }
    )


if __name__ == '__main__':
    kfp.compiler.Compiler().compile(containerop_basic, __file__ + '.yaml')

Upvotes: 1

Views: 1255

Answers (2)

Phil Chae
Phil Chae

Reputation: 1136

I solved the problem. It was an authorization problem for Argo. When executing the pipeline, Argo needs a role to "watch" the pods. So, by adding the role to the serviceaccount it used, the problem was solved.

Upvotes: 1

O. Stern
O. Stern

Reputation: 1

When specifying the file_outputs={'kfp_reference_name': 'file_location'} dictionary, what you're actually doing is telling KFP that when the container's run is ended, KFP should look for the file in the file_location and copy it to a new location that can be accessed by other steps of the pipeline (I won't get into it, but it's basically being done using the Minio server deployed during Kubeflow's installation) using the kfp_reference_name.

From your logs, it seems that your problem is that when KFP is looking for the local file in your container, the file is not available in the specified location, meaning that your problem is probably one of two -

  1. Your container saves the files to another location. For example, it might save the file to the same folder as your code, let's say it's under src folder and then changing your code to following would work -
file_outputs={
    'acc': '/src/accuracy.json',
    'mlpipeline-metrics': '/src/mlpipeline-metrics.json'
}
  1. Your container doesn't save the file at all, meaning you have a problem with somewhere in your code/Dockerfile configurations.

In general, I also recommend going over Kubeflow's data passing tutorial, it's currently one of the best sources regarding the subject - https://github.com/kubeflow/pipelines/blob/master/samples/tutorials/Data%20passing%20in%20python%20components.ipynb

Upvotes: 0

Related Questions