Reputation: 63
I want to be able to do two things:
Store a hash of a datasets contents (so I can decide whether it has updated). To date, I have done this via a second output dataset with a single row that stores the hash and row count. In my Transform I can read that output and compare it to the current build's hash and row count to decide if data has updated. This works fine, but I'd like to avoid having a second dataset if possible.
Pass through timestamps from upstream dependencies so that in downstream workflows I can answer "when did dependency X last update?"
It seems like both of these could be solved by some sort of key-value metadata store on the dataset.
Upvotes: 0
Views: 328
Reputation: 1747
You're correct that one of the most straightforward ways to do this is to decorate the rows with a timestamp value, and in fact with Foundry's Parquet storage system, this will be encoded using Dictionary Encoding, a highly efficient mechanism to store repeated values.
The problem with this approach is you'll have to stack a new column for each phase of updating you want to keep track of. This might prove annoying to maintain in practice.
However, if you don't want to add this data to your rows and instead simply want to store your metadata, you have two options, one of which you've already found:
.csv
or .txt
) to your output keeping track of this informationFoundry won't consider your .csv
or .txt
extra file on the output if you're writing a standard DataFrame to it since your schema by default will only read Parquet files. This means you can store this little snippet of information without affecting your output. If you check platform documentation, you can confirm that it's possible to write both a DataFrame to an output and a file of your own.
It may be simpler to interact with a second output however since the mechanisms of Incremental Transforms and schema handling will be taken care of for you, so I'd recommend proceeding with 1. as you are right now.
Upvotes: 2