Andrew Andrade
Andrew Andrade

Reputation: 2808

How to keep a unique set of keys with an incremental transformation in Palantir Foundry?

I am trying to save compute on a python transform in Foundry.

I want to run my code incrementally, but I want to keep a unique set of keys, without having to do a full snapshot read on the full dataset, and then run the unique.

If I try something like df_out = df.select("composite_key").dropDuplicates() I am afraid it uses the full dataset input, I want to make use of the previous deduplication I already did.

Upvotes: 1

Views: 400

Answers (2)

Kellen Donohue
Kellen Donohue

Reputation: 787

If there are other columns in the new data but you still want to de-dupe by key you can use this approach.

# If there may be duplicates in the data do this step. 
# df = df.dropDuplicates(['composite_key'])

df_prev = df_out.dataframe(mode='previous', schema=df.schema)
# This uses the new row for any existing key. 
# You could do the opposite by swapping the places of the tables.
existing = df_prev.join(df, on='composite_key', how='leftanti')
result = existing.unionByName(df)

Upvotes: 0

Andrew Andrade
Andrew Andrade

Reputation: 2808

The trick here is use the previous version of the output dataset:

    df_out = df.unionByName(
        df_out.dataframe('previous', schema=df.schema).select("composite_key")
    ).drop_duplicates()

Using this pattern, you don't need to do a look up on the full dataset, you use the previously computed unique set of keys, union to the new data and then de-dupe.

Upvotes: 0

Related Questions