AndreyFaktor
AndreyFaktor

Reputation: 1

Partial output update/dynamic partition overwrite in Palantir Foundry

I would really appreciate if someone could help and provide a piece of advice. My case is a bit more complex, but let's consider simpler example.

Let's say we've got a big dataset that needs to be updated incrementally. Updates come several times a day and each time we receive all data we have for today (small amount of data at first and then more and more towards the end of the day). So, each batch contains new data and data from previous batch that may contain some changes or deletions compared to previous batch.

Logically, to make it efficient we need to overwrite data for current day only each time we receive a new batch of data, leaving data for all previous days untouched. If we were using plain Spark we could achieve this by reading only new files, partitioning our output dataset by date, and using Spark configuration setting partitionOverwriteMode = dynamic. That would suit our case perfectly, overwriting today's data each time new data comes.

But, as I understand I can't use such settings in Foundry, I tried it, and it seems like the setting is simply ignored.

Using incremental decorator, on the other hand, can help us to read only new data, but output write mode can only be set to modify (append only) or replace (complete overwrite). So, no option for partial overwrites.

Maybe someone knows, how such scenarios (when append is not enough and total replacement of output is too costly and inefficient) can be handled in Foundry?

Thanks!

Upvotes: 0

Views: 286

Answers (2)

user5233494
user5233494

Reputation: 245

You can split your dataset into a one that builds snapshot (to apply updates), and a historical dataset that builds incrementally, containing older data that will not receive updates.

Upvotes: 0

ZettaP
ZettaP

Reputation: 1399

Indeed the Spark overwrite doesn't work out of the box, as all your changes are - in incremental mode - written to a new transaction on top of your existing transactions. You would need to update existing files (e.g. write an empty file with the same name to delete a file, e.g. overwrite with the same name to update existing records).

There are a few ways to adress this use-case (below is non-exhaustive). It largely depends if your use-case require a "last version only" of your data, and where.

If you don't need a "last version only" but just to process the changes

  • Output an incremental change-log. You can have your output dataset having rows added that represents the changes (this row was deleted, this row was edited, this row was added) so that your downstream pipeline can rely on incremental mechanism to process the updates.

If you need the "last version only" but it can live in an Object

If you need the "latest version only" as a dataset

  • You can create a View on top of the above change-log dataset (in a folder, right click > New > View) and use it's deduplication feature, so that the latest version of the data only is exposed. It also supports deletion column to delete rows that are marked as being "deleted=True"

Upvotes: 1

Related Questions