Luong Ba Linh
Luong Ba Linh

Reputation: 802

DataFrame with groupBy versus RDD with reduceByKey

I have a csv file: (customerId, orderId, spend). I calculate the total spend for every customer using two approaches:

Approach 1: use DataFrame and groupBy

val df = ss.read
.option("header", false)
.option("inferSchema", true)
.csv("data.csv")

df
.groupBy("_c0")
.sum("_c2")
.collect()

Approach 2: use RDD and reduceByKey

sc
.textFile("data.csv")
.map(parseLine)
.reduceByKey(_ + _)
.collect()

private def parseLine(line: String) = {
  val fields = line.split(",")
  (fields(0).toInt, fields(2).toFloat)
}

The result of the two approaches are the same. However, approach 2 is always faster (2x) than approach 1.

First question: is this because approach 1 is using groupBy? If so, how can it happen while I am running in my laptop. i.e. only 1 node, i.e. there is no shuffle cost?

This is my config for spark session

.master("local[*]") 

Second question: how can I modify approach 1 to use DataFrame while retaining good performance like approach 2?

Thank you!

Upvotes: 2

Views: 1068

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

Your first code scans data twice:

  • Once to inferchema.
  • Once to execute count.

Without any further information I'd attribute slower execution to this fact. There are other differences, like the cost of computing execution plan.

Specific memory configuration, including size of the off-heap memory, can affect performance further.

how can I modify approach 1 to use DataFrame while retaining good performance like approach 2?

Provide schema argument for read method.

how can it happen while I am running in my laptop. i.e. only 1 node, i.e. there is no shuffle cost?

Shuffle cost on a local machine might be lower due to local communication, but it is still a full shuffle, including disk IO, and it is still expensive.

is this because approach 1 is using groupBy

No.groupBy has nothing to do with it. Dataset.groupBy is not RDD.groupBy(Key).

Upvotes: 1

Related Questions