Reputation: 318
I created hive table using 10 GB csv file using Hue. Then tried to run SQL query. While processing data it is talking long time more than 2 hr. Can anybody tell me whether this is the spark problem ?? or I did something wrong.
I tried all the possible combinations like changing number of executors, cores and executors memory.
--driver-memory 10g\ --num-executors 10\ --executor-memory 10g\ --executor-cores 10\
I tested by changing num-executors like 10, 15,20,50,100 and same for memory and cores.
Talking about the cluster it has 6 nodes 380+ cores and 1TB memory.
My SQL query: select percentile_approx(x1, array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as x1_quantiles, percentile_approx(x2, array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as x2_quantiles, percentile_approx(x3, array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as x3_quantiles from mytest.test1
code is pretty straightforward
val query= args(0)
val sparkConf= new SparkConf().setAppName("Spark Hive")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.cacheTable(" mytest.test1")
val start = System.currentTimeMillis()
val testload=sqlContext.sql(query)
testload.show()
val end = System.currentTimeMillis()
println("Time took " + (end-start) + " ms")
Upvotes: 0
Views: 2340
Reputation: 330063
Well, it is not a Spark problem. Computing exact quantiles is an expensive process in a distributed environment due to required sorting and related shuffling. Since you compute percentiles on different columns this process is repeated multiple times and can be particularly expensive if variables are not strongly correlated. Typically you should computing exact percentiles unless necessary.
Spark 2.0.0 implements tunable methods for quantile approximation and if you're using an earlier version you can achieve a similar result by simple sampling. See How to find median using Spark
Upvotes: 3