why reusing SparkContext speeds query up so much

Question

In my spark job, I create a SparkContext, load my data via Parquet files, then use spark sql to process them. When I open a spark shell, the 1st time I run a query it takes quite long time, around 200 secs in my case. Then I keep the spark shell open, and run the same query, also some other different queries on the same dataset, it only takes 20-30 secs~, it roughly gains 10x faster performance.

Can Someone give me a detailed explanation on this? Is one spark shell keeps using one SparkContext? If that's the case, how exactly reusing SparkContext speeds things up so much.

Joe Widen · Accepted Answer

The answer is that Spark stores intermediate files on local disk while the context is open. Once the intermediate files are stored, Spark can do a local read to the files on local disk, as opposed to doing a read from the HDFS.

You can validate this by opening the Spark UI and navigating to the DAG schedule. You'll see some of your stages are grey, and some are blue. The grey ones should say stage skipped. That means that the data for that stage was already available.

why reusing SparkContext speeds query up so much

Answers (1)

Related Questions