Reputation: 162
I am trying to refactor/optimize a part of a large architecture. The part I need some advice is how to circumvent the Athena INSERT INTO 100 partitions limit.
I have a table, daily_aggregate
with partition columns of customer, year, month, day
, that is on top of an s3 prefix like:
s3://my-bucket/daily/customer=<customer_id>/year/month/day/<files here>
The specific optimization requires me to create a cumulative version of this table, say sytd_aggregate
(sytd=school year to date):
s3://my-bucket/sytd/customer=<customer_id>/year/month/day/<files here>
. The detail of the SQL query is not too important - basically it does the following bootstrapping:
sytd_aggregate(day=n) = sytd_aggregate(day=n-1) UNION daily_aggregate(day=n)
My query will run daily for all customers and so when I have something like ~2k
customers, this requires me to use 20 some repeated, batched queries like:
// rough, query skeleton to add rollup for 2025-1-9
insert into sytd_aggregate
select *
from
(
select * from daily_aggregate
where customer IN <batch>
and year = 2025
and month = 1
and day = 9
UNION ALL
select * from sytd_aggregate
where customer IN <batch>
and year = 2025
and month = 1
and day = 8
)
Then I just have to dynamically iterate over the batches and it's not that bad of a solution. However, I feel like there is probably some more canonical way to achieve this ?
I looked at some questions on this site like this and several other similar and the official, recommended solution for a similar problem linked at the top but they don't really suggest anything other than batching. Besides, my query runs every day and on a given day, the date partition is just one - it's the customer
column that requires the 2k partitions.
Another idea I had was to simply use UNLOAD TO
or even some direct copying of the underlying s3 data and structure my job on per customer basis which I think in theory is a good solution but it will require me to significantly rework other parts of the pipeline.
Upvotes: 0
Views: 66