spark map RDD vs joins

Question

Wondering which of the two will be more performant for a large dataset.

Lets say I've loaded orders from mongo, the schema for Orders is

case class Orders(organization: String, orderId: Long, recipient: String)

val orders = MongoSpark.load[Orders](spark)

Now I see that there are two ways of going about the next step, I'd like to look up each company that is attributed to an order.

Option 1 is MapRDD

val companies = MongoSpark.load[Company](spark, ReadConfig(...)).map { c => (c.id, c)}
val companiesMap = IndexedRDD(companies.rdd)

or the second option would be to run a join

val joined = orders.join(MongoSpark.load[Company(spark), $orderId === $companyId"

This dataset on a production server ranges from 500 gb - 785 gb.

marios · Accepted Answer

With the latest advances in Spark (>2.0), when it comes to RDD vs DataFrame almost 100% of the time the correct answer is DataFrames. I suggest you always try to stay in the DaraFrame world, and don't transition to RDDs at all.

In more detail: RDDs will always curry all the fields for every row. It will also realize the Scala case class and all the Strings are heavyweight Java Strings, etc. On the other hand, DataFrames with tungsten (whole-stage-code-generators and its optimized encoders) and catalyst make everything faster.

RDD is all Scala/Java. DataFrames use their own super thin encoding for types that has a much more compressed/cache-friendly representation for the same data.
RDD code doesn't go through Catalyst, meaning nothing will actually get (query) optimized.
Finally, DataFrames have a code-generator that really optimizes the chained operations in different stages.

This read is really a must.

spark map RDD vs joins

Answers (1)

Related Questions