Why Spark is very slow to run the same command?

Question

I have a dataframe that I've created it from hive table as following:

import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._

val hiveObj = new HiveContext(sc)
val df = hiveObj.sql("select * from database.table")

the df here is about 2 millions rows;
So I created a sub dataframe subdf from the df above and I limited the numbre of rows to 500 on it as following:

import org.apache.spark.sql.functions.rand

val subdf =df.orderBy(rand()).limit(500)

now when I triesd to display the df rows it took me couple second and when I tried to do the samething with subdf it took me literally more than 10 min despite of the numbre of rows is extremely small.

df.select("col").show() //2sec
subdf.select("col").show() //more than 10 min

could anyone explain what I'm doing wrong here !!

Why Spark is very slow to run the same command?

Answers (1)

Related Questions