Reputation: 366
Wanted to see if there are more details about the way job bookmarking is done in AWS Glue. AWS docs doesn't provide much on this. I know that there are basic functionality in there:
And it seems like that the bookmarking happens at the time:
job.commit()
Can I access it? Can it be modified to reprocess some portion of source?
Upvotes: 1
Views: 5053
Reputation: 329
Some additional info:
The basic tactic for Job Bookmark design would be to save the START time of the last completed job. So when a job is re-run, it will process only the files that have a modification-timestamp newer than the START time of the previous job that was Bookmarked in the Transformation Context parameter.
However, the issue with this design would be that under some conditions, certain files would be incorrectly categorized as processed. For example: suppose a file is written to S3 where the timestamp is just before the job starts, however because of the slight S3 consistency delay, it's not visible to the job at that point. Thus it is not processed in the run, the Bookmark gets updated when the job completes and on the next run it skips the file because it assumes it was previously processed because of the earlier timestamp.
The Bookmarks feature is thus designed to not only save the timestamp of previous job start time, but also a list of files in a certain band of uncertainty around that timestamp. This would include a threshold number of files within a time-range before the timestamp. The next run will thus process any file after that timestamp plus the files that are in that band of uncertainty that have not yet been processed.
The Transformation Context (transformation_ctx) is the element that makes changes to the internal record of processed files. And the job.init command creates or loads a bookmark, while job.commit initializes and commits the bookmark.
Hope that is helpful.
Upvotes: 4