Reputation: 125
I have a react app using Amplify with auth enabled. The app has many users, all of whome are members of one "client", no more.
I would like to be able to limit access to the data in a Glue table to users that are members of the client, using IAM, so that I have a security layer as close to the data layer as possible.
I have a 'clientid' partition in the table. The table is backed by an s3 bucket, with each client's data stored in their own 'clientid=xxxxxx' folder. The table was created by a Glue job with the following option in the "write_dynamic_frame" method at the end, which created the folders.
{"partitionKeys": ["clientid"]},
My first idea was to use the clientid in the front-end to bake the user's client ID into the query to select just their partition but, clearly, that is open to abuse.
Then I tried to use a Glue crawler to scan the existing table's s3 bucket in the hope it would create one table per folder, if I unchecked the "Create a single schema for each S3 path" option. However, the crawler 'sees' the folders as partitions (presumably, in at least part, due to the hive partitioning structure) and I just get a single table again.
There are tens of thousands of clients and TB's of data, so moving/renaming data around and manually creating tables is not feasible.
Please help!
Upvotes: 1
Views: 773
Reputation: 132972
I assume you have a mechanism in place already to assign an IAM role (individual or per client) to each user on the front end, otherwise that's a big topic that should probably be its own question.
The most basic way to solve your problem is to make sure that the IAM roles only have s3:GetObject
permission to the prefix of the partition(s) that the user is allowed to access. This would mean that users can only access their own data and will receive an error if they try accessing other users' data. They could potentially fish for what client IDs are valid, though, by trying different combinations and observing the difference between the query not hitting any partition (which would be allowed since no files would be accessed), and the query hitting a partition (which would not be allowed).
I think it would be better to create tables, or even databases per client, that would allow you to put permissions at the Glue Data Catalog level too, not allowing queries at all for other databases/tables than the user's own. Glue Crawlers won't help you with that unfortunately, they're too limited in what they can do, and will try to be helpful in unhelpful ways. You can create these tables easily with the Glue Data Catalog API and you won't have to move any data, just point the tables' locations at the locations of the current partitions.
Upvotes: 1