Using hadoop cluster with different machine configuration

Question

I have two linux machines, both with different configuration

Machine 1: 16 GB RAM, 4 Virtual Cores and 40 GB HDD (Master and Slave Machine)

Machine 2: 8 GB RAM, 2 Virtual Cores and 40 GB HDD (Slave machine)

I have set up a hadoop cluster between these two machines.
I am using Machine 1 as both master and slave.
And Machine 2 as slave.

I want to run my spark application and utilise as much as Virtual Cores and memory as possible but I am unable to figure out what settings.

My spark code looks something like:

conf = SparkConf().setAppName("Simple Application")
sc = SparkContext('spark://master:7077')
hc = HiveContext(sc)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.appName("SimpleApplication").master("yarn-cluster").getOrCreate()

So far, I have tried the following:

When I process my 2 GB file only on Machine 1 (in local mode as Single node cluster), it uses all the 4 CPUs of the machine and completes in about 8 mins.
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.

What number of executors, cores, memory do I need to set to maximize the usage of cluster?
I have referred below articles but because I have different machine configuration in my case, not sure what parameter would fit best.

Apache Spark: The number of cores vs. the number of executors

Any help will be greatly appreciated.

Using hadoop cluster with different machine configuration

Answers (1)

Related Questions