Reputation: 2808
I am trying to save compute on a python transform in Foundry.
I want to run my code incrementally, but I want to keep a unique set of keys, without having to do a full snapshot read on the full dataset, and then run the unique.
If I try something like df_out = df.select("composite_key").dropDuplicates()
I am afraid it uses the full dataset input, I want to make use of the previous deduplication I already did.
Upvotes: 1
Views: 400
Reputation: 787
If there are other columns in the new data but you still want to de-dupe by key you can use this approach.
# If there may be duplicates in the data do this step.
# df = df.dropDuplicates(['composite_key'])
df_prev = df_out.dataframe(mode='previous', schema=df.schema)
# This uses the new row for any existing key.
# You could do the opposite by swapping the places of the tables.
existing = df_prev.join(df, on='composite_key', how='leftanti')
result = existing.unionByName(df)
Upvotes: 0
Reputation: 2808
The trick here is use the previous version of the output dataset:
df_out = df.unionByName(
df_out.dataframe('previous', schema=df.schema).select("composite_key")
).drop_duplicates()
Using this pattern, you don't need to do a look up on the full dataset, you use the previously computed unique set of keys, union to the new data and then de-dupe.
Upvotes: 0