Ram D
Ram D

Reputation: 167

Changing output dataset path in the transform function

Can we change the output dataset path dynamically in the my_compute_function as show below

from transforms.api import transform, Input, Output


@transform(
  my_output=Output("/path/to/my/dataset"),
  my_input=Input("/path/to/input"),
)
def my_compute_function(my_output, my_input):
  **my_output.path = "new path"**
  my_output.write_dataframe(
    my_input.dataframe()
  )

Upvotes: 0

Views: 823

Answers (1)

Jonathan Ringstad
Jonathan Ringstad

Reputation: 967

No, this is not possible. The reason is that the inputs/outputs/transforms are fixed at "CI-time" or "build-time". When you press "commit" in Authoring or you merge a PR, a CI job is kicked off.

In this CI job, all the relations between inputs and outputs are determined. Output datasets that don't exist yet are created, and a "jobspec" is added to them. A "jobspec" is a snippet of JSON that describes to foundry how a particular dataset is generated.

Anytime you press the "build" button on a dataset (or build the dataset through a schedule or similar), the jobspec is consulted. It contains a reference to the repository, revision, source file and entry point of the function that builds this dataset. From there the build is orchestrated and kicks off, invoking your function to produce the final output.

This mechanism allows you to get a "static view" of the entire pipeline, which you can then visualize with Monocle, as you might have seen.

Notional monocle example

Depending on what your needs are, here are some solutions you might be able to use instead:

  • Tag the rows you're producing in your transform in some way, so that even though you put them into a single dataset, you can later select them by this tag/category
  • If your set of categories does not change often, you can instead create the output datasets ahead of time and then filter the rows into the appropriate dataset they should go into.

The main drawback with the latter approach is that it's not very dynamic, so if a new category shows up, you'll manually have to change the code to "triage" it into a new dataset, until the data becomes available.

There's other solutions (ultimately it is possible to make API calls and to manually adjust inputs/outputs as well, for instance) but they are more complex and undesirable from a maintenance perspective.

Upvotes: 0

Related Questions