Reputation: 548
I have partitioned data stored in S3 in hive format like this.
bucket/year=2017/month=3/date=1/filename.json
bucket/year=2017/month=3/date=2/filename1.json
bucket/year=2017/month=3/date=3/filename2.json
Every partition has around 1,000,000 records. I have created table and partitions in Athena for this.
Now running query from Athena
select count(*) from mts_data_1 where year='2017' and month='3' and date='1'
this query is taking 1800 seconds to scan 1,000,000 records.
So my question is how can I improve this query performance?
Upvotes: 0
Views: 3010
Reputation: 11931
I think the problem is that Athena has to read so many files from S3. 250 MB isn't so much data, but 1,000,000 files is a lot of files. Athena query performance will improve dramatically if you reduce the number of files, and compressing the aggregated files will help some more. How many files do you need for one day's partition? Even with one-minute resolution, you would need less than 1,500 files. If the current query time is ~30 minutes, you might easily start with a lot less.
There are many options for aggregating and compressing your records:
Upvotes: 1