landau
landau

Reputation: 41

Improve performance of processing billions-of-rows data in Spark SQL

In my corporate project, I need to cross join a dataset of over a billion rows with another of about a million rows using Spark SQL. As cross join was used, I decided to divide the first dataset into several parts (each having about 250 million rows) and cross join each part with the million-row one. I then made of use "union all".

Now I need to improve the performance of the join processes. I heard it can be done by partitioning data and distribution of work to Spark workers. My questions are how the effective performance can be made with partitioning? and What are the other ways to do this without using partitioning?

Edit: filtering already included.

Upvotes: 1

Views: 5908

Answers (2)

Matus Danoczi
Matus Danoczi

Reputation: 137

Well, in all scenarios, you will end up with tons of data. Be careful, try to avoid cartesian joins on big data set as much as possible as it usually ends with OOM exceptions.

Yes, partitioning can be the way that help you, because you need to distribute your workload from one node to more nodes or even to the whole cluster. Default partitioning mechanism is hash of key or original partitioning key from source (Spark is taking this from source directly). You need to first evaluate what is your partitioning key right now and afterwards you can find maybe better partitioning key/mechanism and repartition data, therefore distribute load. But, anyway join must be done, but it will be done with more parallel sources.

Upvotes: 1

khandnb
khandnb

Reputation: 19

There should be some filters on your join query. you can use filter attributes as key to partition the data and then join based on the partitioned.

Upvotes: 0

Related Questions