Digi
Digi

Reputation: 85

How to read from and write to the same file in Palantir Foundry?

I have a very simple task of updating contents of a control file in Palantir Foundry. I need to read the contents of the file, perform a check and then write back to the same file. However, if I provide the same file as Input and Output in the transform, I get the following error -

Cyclic dependency exists in the code for the following Foundry datasets:...

Is there a workaround for this?

Updated question with more details and code snippet as per suggestion below

I have a requirement to append the contents of a transaction file to an incremental snapshot file, and this process is supposed to run once daily. However, if the process runs more than once in a day (inadvertently or restarting after a failure etc), I need to ensure that the same records are not appended again. So, I am trying to use the following piece of code (as advised in the documentation).

@incremental()
@transform(
    inp=Input("<path>/daily_trans_file"),
    op=Output("<path>/hist_snapshot_file")
)

def my_compute_function(inp, op):
    
    # Calculates today's date and adds it to the daily transactional data
    proc_date = (datetime.today()).strftime("%Y-%m-%d")
    inp_df = inp.dataframe()
    inp_df = inp_df.withColumn("Processed_Date", lit(proc_date))

    # Appends the above dataframe to the end of the op file if it is not already present
    op.write_dataframe(
        inp_df.subtract(
        op.dataframe('previous', schema=inp_df.schema)))

However, there seems to be some problem with this. The file just retains the data that was originally there and does not append any new data. Not sure why the subtract is not working as it should have been. I checked the documentation example and there is no need to set any other mode in this case.

Upvotes: 3

Views: 1850

Answers (1)

ollie299792458
ollie299792458

Reputation: 256

It is possible to write and read from the same dataset in Foundry, however it is not possible for two transforms to write to the same file (which seems to be what you require). But reading and writing from the same file is rarely necessary, it is almost always best to have a linear data flow.

By making use of incremental transforms (transform not transform_df), it is possible to read the previous version of the output of the transform, you want to use output read mode previous, and write mode replace (set this after you read the df), to first read the output, then write it. Note that you'll need to have something in the output on first run, so when the transform is run non-incrementally you'll need to just populate the output from some other input (and you might want to do this as well when it is run incrementally, to ensure data freshness). For more details see Code Repositories - Incremental Python Transforms in the documentation.

However, for the workflow you're suggesting (as I understand it) you're probably better off using two files, with a input and a verified file, making use of the verified file downstream.

Updated answer to specific issue

Your approach is generally correct, however I would separate the data logic from the control logic.

What I mean by this is you want to get the latest date from the output first, and compare that to todays date. Then, if they are different, write the input data, otherwise, write an empty dataframe (sqlContext.createDataFrame(sc.emptyRDD(), schema)).

This way it is much easier to debug, you can check the control and the data parts separately. Also, it is computationally cheaper than doing a subtract every time.

But, your specific issue I believe is that you don't have inp added as a snapshot input, so your transform is sometimes running in non-incremental mode, and sometimes running with an empty input. To fix this add inp as a snapshot input using the snapshot_inputs paramater of @incremental().

Upvotes: 1

Related Questions