In Palantir Foundry, why should I not build downstream of Changelog datasets?

Question

Foundry has a concept of changelog datasets, which I am using in order to speed up my ontology syncs. However, I've been told to always build datasets from the 'snapshot' version of the dataset, rather than the 'changelog' version. Why is this?

hjones · Accepted Answer

In summary: Changelog datasets (by design) include previous versions of the same row. Unless your transform is designed to handle this, your transform will behave as if there were incorrect or duplicated input data.

Each time a Changelog dataset is built, any changes to the input data are appended to the changelog dataset as new rows. This is done because Foundry's Object Storage then can just apply the diff against the currently synced data, minimising the amount of data that needs to be synced.

This means that the changelog dataset is designed to contain multiple entries for each single row in the input dataset---every time an input row changes, the changelog dataset will have appended another entry containing a new version of that row.

Unless your transform is expecting this:

you will, in effect, end up processing multiple versions of the same row, which could appear as if you were working with outdated and/or duplicated rows.
unless your transform is always incrementally passing through just the appended rows using an APPEND transaction, your output won't preserve the benefits of the append-only behaviour of changelog datasets.
- This might mean you have multiple entries for a single row in a SNAPSHOT transaction. If you then synched that data to Object Storage, you could end up seeing multiple or outdated versions of rows in your ontology.

As a result, unless your transform is designed to handle the format of changelog datasets, it's best to build off the 'snapshot' version of datasets.

In Palantir Foundry, why should I not build downstream of Changelog datasets?

Answers (1)

Related Questions