Reputation: 4797
I have a dataframe that I've created it from hive table as following:
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val hiveObj = new HiveContext(sc)
val df = hiveObj.sql("select * from database.table")
the df
here is about 2 millions rows;
So I created a sub dataframe subdf
from the df
above and I limited the numbre of rows to 500 on it as following:
import org.apache.spark.sql.functions.rand
val subdf =df.orderBy(rand()).limit(500)
now when I triesd to display the df
rows it took me couple second and when I tried to do the samething with subdf
it took me literally more than 10 min despite of the numbre of rows is extremely small.
df.select("col").show() //2sec
subdf.select("col").show() //more than 10 min
could anyone explain what I'm doing wrong here !!
Upvotes: 0
Views: 1307
Reputation: 35249
The reason is obvious when you think about amount of required to compute the result.
The first statement has to check only the minimum number of partitions (possibly one) to collect 20 records (default number of rows returned by show
)
The second statement has to:
The first scenario is almost as cheap as it gets, the cost of the second one would be exceeded only by a full shuffle.
Upvotes: 2