mendo
mendo

Reputation: 86

PartitionedDataSet not found when Kedro pipeline is run in Docker

I have multiple text files in an S3 bucket which I read and process. So, I defined PartitionedDataSet in Kedro datacatalog which looks like this:

raw_data:
  type: PartitionedDataSet
  path: s3://reads/raw
  dataset: pandas.CSVDataSet
  load_args:
    sep: "\t"
    comment: "#"

In addition, I implemented this solution to get all secrets from credentials file via environment variables including AWS secret keys.

When I run things locally using kedro run everything works just fine, but when I build Docker image (using kedro-docker) and run pipeline in Docker environement with kedro docker run and by providing all enviornement variables using --docker-args option I get the following error:

Traceback (most recent call last):
  File "/usr/local/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 724, in main
    cli_collection()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/kedro/kedro_cli.py", line 230, in run
    pipeline_name=pipeline,
  File "/usr/local/lib/python3.7/site-packages/kedro/framework/context/context.py", line 767, in run
    raise exc
  File "/usr/local/lib/python3.7/site-packages/kedro/framework/context/context.py", line 759, in run
    run_result = runner.run(filtered_pipeline, catalog, run_id)
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 101, in run
    self._run(pipeline, catalog, run_id)
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/sequential_runner.py", line 90, in _run
    run_node(node, catalog, self._is_async, run_id)
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 213, in run_node
    node = _run_node_sequential(node, catalog, run_id)
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 221, in _run_node_sequential
    inputs = {name: catalog.load(name) for name in node.inputs}
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 221, in <dictcomp>
    inputs = {name: catalog.load(name) for name in node.inputs}
  File "/usr/local/lib/python3.7/site-packages/kedro/io/data_catalog.py", line 392, in load
    result = func()
  File "/usr/local/lib/python3.7/site-packages/kedro/io/core.py", line 213, in load
    return self._load()
  File "/usr/local/lib/python3.7/site-packages/kedro/io/partitioned_data_set.py", line 240, in _load
    raise DataSetError("No partitions found in `{}`".format(self._path))
kedro.io.core.DataSetError: No partitions found in `s3://reads/raw`

Note: Pipeline works just fine in Docker environment, if I move files to some local directory, define PartitionedDataSet and build Docker image and provide environment variables through --docker-args

Upvotes: 2

Views: 1086

Answers (1)

mendo
mendo

Reputation: 86

The solution (at least in my case) was to provide AWS_DEFAULT_REGION env variable in the kedro docker run command.

Upvotes: 2

Related Questions