Reputation: 1873
I am continuously receiving and storing multiple feeds of uncompressed JSON objects, partitioned to the day, to different locations of an Amazon S3 bucket (hive-style: s3://bucket/object=<object>/year=<year>/month=<month>/day=<day>/object_001.json
), and was planning to incrementally batch and load this data to a Parquet data lake using AWS Glue:
This design pattern & architecture seemed to be quite a safe approach as it was backed up by many AWS blogs, here and there.
I have a crawler configured as so:
{
"Name": "my-json-crawler",
"Targets": {
"CatalogTargets": [
{
"DatabaseName": "my-json-db",
"Tables": [
"some-partitionned-json-in-s3-1",
"some-partitionned-json-in-s3-2",
...
]
}
]
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"Configuration": "{\"Version\":1.0,\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
}
And each table was "manually" initialized as so:
{
"Name": "some-partitionned-json-in-s3-1",
"DatabaseName": "my-json-db",
"StorageDescriptor": {
"Columns": [] # i'd like the crawler to figure this out on his first crawl,
"Location": "s3://bucket/object=some-partitionned-json-in-s3-1/",
"PartitionKeys": [
{
"Name": "year",
"Type": "string"
},
{
"Name": "month",
"Type": "string"
},
{
"Name": "day",
"Type": "string"
}
],
"TableType": "EXTERNAL_TABLE"
}
}
First run of the crawler is, as expected, an hour-ish long, but it successfully figures out the table schema and existing partitions. Yet from that point onward, re-running the crawler takes the exact same amount of time as the first crawl, if not longer; which lead me to believe that the crawler is not only crawling for new files / partitions, but recrawling all the entire S3 locations each time.
Note that the delta of new files between two crawls is very small (few new files are to be expected each time).
AWS Documentation suggests running multiple crawlers, but I am not convinced that this would solve my problem on the long run. I also considered updating the crawler exclude pattern after each run, but then I would see too few advantages using Crawlers over manually updating Tables partitions through some Lambda boto3 magic.
Am I missing something there ? Maybe an option I would have misunderstood regarding crawlers updating existing data catalogs rather than crawling data stores directly ?
Any suggestions to improve my data cataloging ? Given that indexing these JSON files in Glue tables is only necessary to me as I want my Glue Job to use bookmarking.
Thanks !
Upvotes: 5
Views: 10557
Reputation: 66
AWS Glue Crawlers now support Amazon S3 event notifications natively, to solve this exact problem.
See the blog post.
Upvotes: 2
Reputation: 1873
Still getting some hits on this unanswered question of mine, so I wanted to share a solution I found adequate at the time: I ended up not using crawlers, at all to incrementally update my Glue tables.
Using S3 Events / S3 Api Calls via CloudTrail / S3 Eventbridge notifications (pick one), ended up writing a lambda which pops a ALTER TABLE ADD PARTITION
DDL query on Athena, updating an already existing Glue table with the newly created partition, based on the S3 key prefix. This is a pretty straight-forward and low-code approach to maintaining Glue tables in my opinion; the only downside being handling service throttling (both Lambda and Athena), and failing queries to avoid any loss of data in the process.
This solution scales up pretty well though, as parallel DDL queries per account is a soft-limit quota that can be increased as your need for updating more and more tables increases; and works pretty well for non-time-critical workflows.
Works even better if you limit S3 writes to your Glue tables S3 partitions (one file per Glue table partition is ideal in this particular implementation) by batching your data, using a Kinesis DeliveryStream for example.
Upvotes: 0