David
David

Reputation: 21

Why RDD calculating count take so much time

(English is not my first language so please excuse any mistakes)

I use SparkSQL reading 4.7TB data from hive table, and performing a count operation. It takes about 1.6 hours to do that. While reading directly from HDFS txt file and performing count, it takes only 10 minutes. The two jobs used same resources and parallelism. Why RDD count takes so much time?

The hive table has about 3000 thousand columns, and maybe serialization is costly. I checked the spark UI and each tasks read about 240MB data and take about 3.6 minutes to execute. I can't believe that serialization overhead is so expensive.

Reading from hive(taking 1.6 hours):

val sql = s"SELECT * FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd
val count = hiveData.count()

Reading from hdfs(taking 10 minutes):

val inputPath = s"/path/to/above/hivetable"
val hdfsData = sc.textFile(inputPath)
val count = hdfsData.count()

While using SQL count, it still takes 5 minutes:

val sql = s"SELECT COUNT(*) FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd
hiveData.foreach(println(_))

Upvotes: 0

Views: 2369

Answers (2)

Jonathan Myers
Jonathan Myers

Reputation: 930

Your first method is querying the data instead of fetching the data. Big difference.

val sql = s"SELECT * FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd

We can look at the above code as programmers and think "yes, this is how we grab all of the data". But the way that the data is being grabbed is via query instead of reading it from a file. Basically, the following steps occur:

  • Read from file into temporary storage
  • A Query engine processes query on temp storage and creates results
  • Results are read into an RDD

There's a lot of steps there! More so than what occurs by the following:

val inputPath = s"/path/to/above/hivetable"
val hdfsData = sc.textFile(inputPath)

Here, we just have one step:

  • Read from file into RDD

See, that's 1/3 of the steps. Even though it is a simple query, there is still a lot of overhead and processing involved in order to get it into that RDD. Once it's in the RDD though, processing will be easier. As shown by your code:

val count = hdfsData.count()

Upvotes: 2

sev7e0
sev7e0

Reputation: 106

Your first way it will be load all data to spark, the network, serialization and transform operation it will take a lot of time.

The second way, I think it's because he omitted the hive layer.

If you just count, the third way is better, it's to load only count results after executes count

Upvotes: 0

Related Questions