AmirCS
AmirCS

Reputation: 331

Why Dataflow does not support SortByKey?

I was wondering why Dataflow does not support 'SortByKey' like Apache Spark.

I have a huge table in BigQuery that I cannot sort it because "Order By" is not scalable. So, I was thinking to move the output of BigQuery to Dataflow and sort it there. But, there is no SortByKey and it seems I have to write a combiner.

Any suggestions will be appreciated.

Upvotes: 1

Views: 446

Answers (1)

Ben Chambers
Ben Chambers

Reputation: 6130

Sorting (especially by key) requires globally serial processing, which is not a scalable operation. Apache Beam / Dataflow does not provide such support, as it is frequently unnecessary.

There are a variety of alternatives that generally address the need more scalably. For instance, you can sort the values within each key, which allows each key to be processed in parallel. Another common use case is TopN either globally or per-key. Again, this can be supported much more efficiently than actually sorting.

Could you elaborate on what you need to sort by and why? It would make it possible to identify options for implementing this within the Beam and Dataflow SDKs.

Upvotes: 1

Related Questions