Reputation: 49
I need a system to read a S3 bucket for analysis. The bucket is broken down to Year/Month/Day/Hour where each Hour folder has a lot of zipped files that amount to over 2GBs.
Is this something to be scripting in Python with boto3? Looking for any general direction.
Upvotes: 1
Views: 3984
Reputation: 270144
Amazon Athena can run SQL-like queries across multiple files stored in Amazon S3.
The files can be compressed with gzip. In fact, Athena will run faster and cheaper on compressed files because you are only charged for the amount of data scanned from disk.
All files in a given folder (path) in Amazon S3 must be in the same format. For example, if they are CSV files in gzip format, all the files must have the same number of columns in the same order.
You can then use CREATE TABLE in Amazon Athena, which defines the columns in the data files and the location of the data. This is the hardest part, because you have to get the format correctly defined.
Then, you can run SQL SELECT commands to query the data, which will apply to all files in the designated folder.
In future, if you want to add or remove data, simply update the contents of the folder. The SELECT
command always looks at the files in the folder at the time that the command is run.
Given your requirement of "count distinct values of a customer_id and group them by item_id across all files", it would be something like:
SELECT
item_id,
COUNT(DISTINCT customer_id)
FROM table
GROUP BY 1
Upvotes: 1