Reputation: 760
I have two linux machines, both with different configuration
Machine 1: 16 GB RAM, 4 Virtual Cores and 40 GB HDD (Master and Slave Machine)
Machine 2: 8 GB RAM, 2 Virtual Cores and 40 GB HDD (Slave machine)
I have set up a hadoop cluster between these two machines.
I am using Machine 1 as both master and slave.
And Machine 2 as slave.
I want to run my spark application and utilise as much as Virtual Cores and memory as possible but I am unable to figure out what settings.
My spark code looks something like:
conf = SparkConf().setAppName("Simple Application")
sc = SparkContext('spark://master:7077')
hc = HiveContext(sc)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.appName("SimpleApplication").master("yarn-cluster").getOrCreate()
So far, I have tried the following:
When I process my 2 GB file only on Machine 1 (in local mode as Single node cluster), it uses all the 4 CPUs of the machine and completes in about 8 mins.
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.
What number of executors, cores, memory do I need to set to maximize the usage of cluster?
I have referred below articles but because I have different machine configuration in my case, not sure what parameter would fit best.
Apache Spark: The number of cores vs. the number of executors
Any help will be greatly appreciated.
Upvotes: 0
Views: 348
Reputation: 191701
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.
Its not clear where your file is stored.
I see you're using Spark Standalone mode, so I'll assume it's not split on HDFS into about 16 blocks (given block size of 128MB).
In that scenario, your entire file will processed at least once in whole, plus the overhead of shuffling that data amongst the network.
If you used YARN as the Spark master with HDFS as the FileSystem, and a splittable file format, then the computation would go "to the data", which you could expect quicker run times.
As far as optimal settings, there's tradeoffs between cores&memory and amount of executors, but there's no magic number for a particular workload and you'll always be limited by the smallest node in the cluster, keeping in mind the memory of the Spark driver and other processes on the OS should be accounted for when calculating sizes
Upvotes: 0