Gunjan Khandelwal
Gunjan Khandelwal

Reputation: 189

Athena Best Practice to store query result

I am creating a Data Lake and have some tables in Glue Catalog that I need to query in Athena. As a prerequisite, Athena requires us to store the query results in a S3 bucket. I have "Temp" and "Logs" S3 buckets. But since this is client sensitive data, I just want to check should I create a new Athena bucket for this or use the existing temp/logs bucket.

Note: I dont have any future use of the Athena queries.

Upvotes: 1

Views: 3967

Answers (2)

Zerodf
Zerodf

Reputation: 2298

I would also add that Athena also keeps a history that might contain sensitive data such as PII, should that appear in your query.

Assuming the following data, DDL, and queries:

Data:

breed_id, breen_name, category
1,pug,toy
2,German Shepard, working,
3,Scottish Terrier, Working

DDL:

CREATE EXTERNAL TABLE default.dogs (
  `breed_id` int, 
  `breed_name` string, 
  `category` string
)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
LOCATION
  's3://stack-exchange/48836509'
TBLPROPERTIES ('skip.header.line.count'='1')

Queries:

SELECT * FROM default.dogs WHERE breed_name = 'pug'
SELECT * FROM default.dogs WHERE breed_name = 'German Shepard'

We can see these in the console:

enter image description here

Based on these documentation, I believe this history is stored for 45 days.

Upvotes: 0

John Rotenstein
John Rotenstein

Reputation: 270224

That's a good point you make -- the output of the Amazon Athena queries will appear in the output files, including sensitive data.

You could create a bucket that only permits Write access -- that is, put a Deny policy on it so that nobody can GetObject from the bucket. That way, Athena is happy to write its output, but people can't see the results.

You could also apply an Amazon S3 lifecycle policy that deletes the files after one day.

An alternate method would be to trigger an AWS Lambda function as soon as the object is created and have the Lambda function delete the object.

Either way, ask people to direct their Athena output to that bucket if they don't need to access the results, or if there is sensitive data being retrieved.

Upvotes: 1

Related Questions