How to join two huge dataset in Optimal way in Scala Spark

Question

and contains around 5 billion rows

Dataset B contains the structure:- origin|destination|stops|route

The dataset B actually holds information about every segment in dataset A and is almost 6 times the size of dataset A

In order to enrich the stop and route details what I'm doing right now is:-

for (x <- 1 to 6){
    DatasetB.withColumnRenamed("stops", s"segment${x}_stops").withColumnRenamed("route", s"segment${x}_route")
    DatasetA.join(DatasetB, (col(s"segment${x}departure") === col("origin"))
            && (col(s"segment${x}Arrival") === col("destination")), "left").drop("origin", "destination")
}

And this solution is working fine. But my worry point is that I'm joining it 6 times. I was just curious to know if there is any way to make this optimized? This is causing skewness and the job gets slow at later stages.

Is there a way in Scala/Spark dataframe to write this in a better way?

How to join two huge dataset in Optimal way in Scala Spark

Answers (1)

Related Questions