Gavisha BN
Gavisha BN

Reputation: 141

How to generate dynamic files using config file in palantir foundry

I have two columns in config file col1 and col2.

enter image description here

Now I have to import this config file in my main python-transform and then extract the values of columns in order to create dynamic output path from these values by iterating over all the possible values.

For example ouput_path1=Constant+value1+value2

ouput_path2=Constant+value3+value4

Please suggest some solution for generating output file in palantir foundary(code-repo)

Upvotes: 0

Views: 1588

Answers (2)

fmsf
fmsf

Reputation: 37137

You cannot programmatically create transforms based on another datasets's content. The datasets are created at CI time.

You can however have a constants file inside your code repo, which can be read at CI time, and use that to generate transforms. I.e.:

myconfig.py:
dataset_pairs = [
  {
    "in": "/path/to/input/dataset,
    "out": "/path/to/output/dataset,
  },
  {
    "in": "/path/to/input/dataset2,
    "out": "/path/to/output/dataset2,
  },
  # ...
  {
    "in": "/path/to/input/datasetN,
    "out": "/path/to/output/datasetN,
  },

]

///////////////////////////
anotherfile.py
from myconfig import dataset_pairs

TRANSFORMS = []
for conf in dataset_pairs:
  @transform_df(Output(conf["out"]), my_input=Input(conf["in"]))
  def my_generated_transform(my_input)
     # ...
     return df

  TRANSFORMS.append(my_generated_transform)

To re-iterate, you cannot create the config.py programatically based on a dataset contents, because when this code is run, it is at CI time, so it doesn't have access to the datasets.

Upvotes: 0

Jonathan Ringstad
Jonathan Ringstad

Reputation: 967

What you probably want to use is a transform generator. In the "Python Transforms" chapter of the documentation, there's a section "Transform generation" which outlines the basics of this.

The most straightforward path is likely to generate multiple transforms, but if you want just one transform that outputs to multiple datasets, that would be possible too (if a little more complicated.)

For the former approach, you would add a .yaml file (or similar) to your repo, in which you define your values, and then you read the .yaml file and generate multiple transforms based on the values. The documentation gives an example that does pretty much exactly this.

For the latter approach, you would probably want to read the .yaml file in your pipeline definer, and then dynamically add outputs to a single transform. In your transforms code, you then need to be able to handle an arbitrary number of outputs in some way (which I presume you have a plan for.) I suspect you might need to fall back to manual transform registration for this, or you might need to construct a transforms object without using the decorator. If this is the solution you need, I can construct an example for you.

Before you proceed with this though, I want to note that the number of inputs and outputs is fixed at "CI-time" or "compile-time". When you press the "commit" button in Authoring (or you merge a PR), it is at this point that the code is run that generates the transforms/outputs. At a later time, when you build the actual dataset (i.e. you run the transforms) it is not possible to add/remove inputs, outputs and transforms anymore.

So to change the number of inputs/outputs/transforms, you will need to go to the repo, modify the .yaml file (or whatever you chose to use) and then press the commit button. This will cause the CI checks to run, and publish the new code, including any new transforms that might have been generated in the process.

If this doesn't work for you (i.e. you want to decide at dataset build-time which outputs to generate) you'll have to fundamentally re-think your approach. Otherwise you should be good with one of the two solutions I roughly outlined above.

Upvotes: 1

Related Questions