Reputation: 13192
I have a table in Amazon Athena (a Presto engine) of some user events that among others have userId (VARCHAR) and ts (timestamp, BIGINT) column. The table is quite big, some hundreds of millions of records. I would like to create a query that orders the events by the ts column. I quickly figured out that I can not use a global ORDER BY ts
because this would mean that all ts values should be stored in the memory of a single worker node and it causes an out-of-memory type error.
The point is that actually I do not need to sort these events globally, it would be enough if they were sorted for a single userid. I was trying to use also a compound query where the outer query gets all unique userid and the inner query use a WHERE userid = current_userid
clause but as I have around 50k distinct userid this query runs way too much time. (Same thing for JOIN
ing together the results of the query as described here). I am looking for some sql construct that uses sorts the rows only partially, something like ORDER BY ts OVER (PARTITION BY userid)
(this clause is not valid).
Upvotes: 1
Views: 360
Reputation: 20730
I quickly figured out that I can not use a global ORDER BY ts because this would mean that all ts values should be stored in the memory of a single worker node and it causes an out-of-memory type error
Presto supports distributed sort for over a year now (contributed by Starburst). Distributed sort removes the need to put all data in one node's memory and thus allows for linear scaling for sort operation. The more nodes you have, the more data you can sort, without any performance hit (no disk/storage is involved).
I don't think there is a way to force an older Presto versions to effectively do total ordering without putting all data in one node's memory. Thus, until Athena supports distributed sort natively, you cannot compensate on user side.
You can get latest Presto version from https://trino.io/download.html
Since you're on AWS, you can use Starburst Presto for AWS to have one-click (ok, actually "few clicks") deployment on Amazon. (I'm from Starburst).
Upvotes: 1
Reputation: 222462
You seem to be looking for:
ORDER BY RANK() OVER (PARTITION BY userid ORDER BY ts)
I am unsure, however, whether this will actually limit the memory consumption of you workers. You would need to test this out.
Upvotes: 0