Reputation: 153
I'm beginner to pyspark , having trouble understanding how changes made to executor memory affects Job Run time. I ran code with different configurations (shared below) , and realized when i reduce executor memory - my job is taking less time. Could anyone please guide on what's the reason behind. I Ran below pyspark code :
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
hiveCtx = HiveContext(spark)
base_df = hiveCtx.sql("select * from base_table")
base_df.count()
Base table data : 7.4 GB
Configurations :
CASE 1
--driver-memory 4g --executor-memory 10g --executor-cores 10 --num-executors 2 \ - JOB TIME : 36 secs
CASE 2
--driver-memory 4g --executor-memory 6g --executor-cores 10 --num-executors 2 \ - JOB TIME : 19 secs
CASE 3
--driver-memory 4g --executor-memory 2g --executor-cores 10 --num-executors 2 \ - JOB TIME : 12 secs
Also one more question : In CASE 3 I have 'executor-memory' 2GB and 'num-executors' 2 , means a total of 4GB AND my data is 7.4 GB (which is much more than the resources i have allocated i.e. only 4GB) . Then how come i'm getting better peeformance.
Upvotes: 0
Views: 348
Reputation: 21
The first thing I would check is whether the execution plan for running the three configurations is exactly the same. The execution plan may be different when resources have been allocated differently. To check out the physical plan for each of the three configurations pipe the "explain" operator after the "sql" operator as follows: hiveCtx.sql("select * from base_table").explain()
. The physical plan will be printed in the Driver's log.
The format of the plan looks something like this:
== Physical Plan ==
*(1) FileScan parquet default.src[key#10,value#11] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/home/spark-work/spark-warehouse/src], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,value:string>
Upvotes: 0
Reputation: 530
First thing is memory allocation and dessallocation.
Allocation and desallocation of memory takes time ... depening on you ressource manager. So this differences in second 15 sec for 20G make sense ...
Is there any other concurent task running on your cluster ?
For the case 3 you hive table have partition and spark run process partition by partition so you do not have full dataset in memory. You will have ~ 2*10 partition in memory on the same time.
If you wan't more information on the profile of you task go to Spark History Ui you will get by task amount of time on each task and time diagram of you spark job. Check out Spark History documentation. https://spark.apache.org/docs/latest/monitoring.html
Upvotes: 1