TheM00s3
TheM00s3

Reputation: 3711

spark map RDD vs joins

Wondering which of the two will be more performant for a large dataset.

Lets say I've loaded orders from mongo, the schema for Orders is

case class Orders(organization: String, orderId: Long, recipient: String)

val orders = MongoSpark.load[Orders](spark)

Now I see that there are two ways of going about the next step, I'd like to look up each company that is attributed to an order.

Option 1 is MapRDD

val companies = MongoSpark.load[Company](spark, ReadConfig(...)).map { c => (c.id, c)}
val companiesMap = IndexedRDD(companies.rdd)

or the second option would be to run a join

val joined = orders.join(MongoSpark.load[Company(spark), $orderId === $companyId"

This dataset on a production server ranges from 500 gb - 785 gb.

Upvotes: 2

Views: 706

Answers (1)

marios
marios

Reputation: 8996

With the latest advances in Spark (>2.0), when it comes to RDD vs DataFrame almost 100% of the time the correct answer is DataFrames. I suggest you always try to stay in the DaraFrame world, and don't transition to RDDs at all.

In more detail: RDDs will always curry all the fields for every row. It will also realize the Scala case class and all the Strings are heavyweight Java Strings, etc. On the other hand, DataFrames with tungsten (whole-stage-code-generators and its optimized encoders) and catalyst make everything faster.

  • RDD is all Scala/Java. DataFrames use their own super thin encoding for types that has a much more compressed/cache-friendly representation for the same data.

  • RDD code doesn't go through Catalyst, meaning nothing will actually get (query) optimized.

  • Finally, DataFrames have a code-generator that really optimizes the chained operations in different stages.

This read is really a must.

Upvotes: 2

Related Questions