HISI
HISI

Reputation: 4797

Why Spark is very slow to run the same command?

I have a dataframe that I've created it from hive table as following:

import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._

val hiveObj = new HiveContext(sc)
val df = hiveObj.sql("select * from database.table")

the df here is about 2 millions rows;
So I created a sub dataframe subdf from the df above and I limited the numbre of rows to 500 on it as following:

import org.apache.spark.sql.functions.rand

val subdf =df.orderBy(rand()).limit(500)

now when I triesd to display the df rows it took me couple second and when I tried to do the samething with subdf it took me literally more than 10 min despite of the numbre of rows is extremely small.

df.select("col").show() //2sec
subdf.select("col").show() //more than 10 min

could anyone explain what I'm doing wrong here !!

Upvotes: 0

Views: 1307

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35249

The reason is obvious when you think about amount of required to compute the result.

  • The first statement has to check only the minimum number of partitions (possibly one) to collect 20 records (default number of rows returned by show)

  • The second statement has to:

    • Load all records.
    • Find first 500 records from each partition, according to ordering.
    • Shuffle the records to a single partition.
    • Get final 500 records.
    • Print 20 records.

The first scenario is almost as cheap as it gets, the cost of the second one would be exceeded only by a full shuffle.

Upvotes: 2

Related Questions