MrTJ
MrTJ

Reputation: 13192

SQL partially order results

I have a table in Amazon Athena (a Presto engine) of some user events that among others have userId (VARCHAR) and ts (timestamp, BIGINT) column. The table is quite big, some hundreds of millions of records. I would like to create a query that orders the events by the ts column. I quickly figured out that I can not use a global ORDER BY ts because this would mean that all ts values should be stored in the memory of a single worker node and it causes an out-of-memory type error.

The point is that actually I do not need to sort these events globally, it would be enough if they were sorted for a single userid. I was trying to use also a compound query where the outer query gets all unique userid and the inner query use a WHERE userid = current_userid clause but as I have around 50k distinct userid this query runs way too much time. (Same thing for JOINing together the results of the query as described here). I am looking for some sql construct that uses sorts the rows only partially, something like ORDER BY ts OVER (PARTITION BY userid) (this clause is not valid).

Upvotes: 1

Views: 360

Answers (2)

Piotr Findeisen
Piotr Findeisen

Reputation: 20730

I quickly figured out that I can not use a global ORDER BY ts because this would mean that all ts values should be stored in the memory of a single worker node and it causes an out-of-memory type error

Presto supports distributed sort for over a year now (contributed by Starburst). Distributed sort removes the need to put all data in one node's memory and thus allows for linear scaling for sort operation. The more nodes you have, the more data you can sort, without any performance hit (no disk/storage is involved).

I don't think there is a way to force an older Presto versions to effectively do total ordering without putting all data in one node's memory. Thus, until Athena supports distributed sort natively, you cannot compensate on user side.

You can get latest Presto version from https://trino.io/download.html

Since you're on AWS, you can use Starburst Presto for AWS to have one-click (ok, actually "few clicks") deployment on Amazon. (I'm from Starburst).

Upvotes: 1

GMB
GMB

Reputation: 222462

You seem to be looking for:

ORDER BY RANK() OVER (PARTITION BY userid ORDER BY ts)

I am unsure, however, whether this will actually limit the memory consumption of you workers. You would need to test this out.

Upvotes: 0

Related Questions