Reputation: 802
I have a csv file: (customerId, orderId, spend). I calculate the total spend for every customer using two approaches:
Approach 1: use DataFrame and groupBy
val df = ss.read
.option("header", false)
.option("inferSchema", true)
.csv("data.csv")
df
.groupBy("_c0")
.sum("_c2")
.collect()
Approach 2: use RDD and reduceByKey
sc
.textFile("data.csv")
.map(parseLine)
.reduceByKey(_ + _)
.collect()
private def parseLine(line: String) = {
val fields = line.split(",")
(fields(0).toInt, fields(2).toFloat)
}
The result of the two approaches are the same. However, approach 2 is always faster (2x) than approach 1.
First question: is this because approach 1 is using groupBy? If so, how can it happen while I am running in my laptop. i.e. only 1 node, i.e. there is no shuffle cost?
This is my config for spark session
.master("local[*]")
Second question: how can I modify approach 1 to use DataFrame while retaining good performance like approach 2?
Thank you!
Upvotes: 2
Views: 1068
Reputation: 35229
Your first code scans data twice:
inferchema
.Without any further information I'd attribute slower execution to this fact. There are other differences, like the cost of computing execution plan.
Specific memory configuration, including size of the off-heap memory, can affect performance further.
how can I modify approach 1 to use DataFrame while retaining good performance like approach 2?
Provide schema argument for read
method.
how can it happen while I am running in my laptop. i.e. only 1 node, i.e. there is no shuffle cost?
Shuffle cost on a local machine might be lower due to local communication, but it is still a full shuffle, including disk IO, and it is still expensive.
is this because approach 1 is using groupBy
No.groupBy
has nothing to do with it. Dataset.groupBy
is not RDD.groupBy(Key)
.
Upvotes: 1