Reputation: 1
I would really appreciate if someone could help and provide a piece of advice. My case is a bit more complex, but let's consider simpler example.
Let's say we've got a big dataset that needs to be updated incrementally. Updates come several times a day and each time we receive all data we have for today (small amount of data at first and then more and more towards the end of the day). So, each batch contains new data and data from previous batch that may contain some changes or deletions compared to previous batch.
Logically, to make it efficient we need to overwrite data for current day only each time we receive a new batch of data, leaving data for all previous days untouched. If we were using plain Spark we could achieve this by reading only new files, partitioning our output dataset by date, and using Spark configuration setting partitionOverwriteMode = dynamic. That would suit our case perfectly, overwriting today's data each time new data comes.
But, as I understand I can't use such settings in Foundry, I tried it, and it seems like the setting is simply ignored.
Using incremental decorator, on the other hand, can help us to read only new data, but output write mode can only be set to modify (append only) or replace (complete overwrite). So, no option for partial overwrites.
Maybe someone knows, how such scenarios (when append is not enough and total replacement of output is too costly and inefficient) can be handled in Foundry?
Thanks!
Upvotes: 0
Views: 286
Reputation: 245
You can split your dataset into a one that builds snapshot (to apply updates), and a historical dataset that builds incrementally, containing older data that will not receive updates.
Upvotes: 0
Reputation: 1399
Indeed the Spark overwrite doesn't work out of the box, as all your changes are - in incremental mode - written to a new transaction on top of your existing transactions. You would need to update existing files (e.g. write an empty file with the same name to delete a file, e.g. overwrite with the same name to update existing records).
There are a few ways to adress this use-case (below is non-exhaustive). It largely depends if your use-case require a "last version only" of your data, and where.
If you don't need a "last version only" but just to process the changes
If you need the "last version only" but it can live in an Object
If you need the "latest version only" as a dataset
Upvotes: 1