Improving SQL Query using Spark Multi Clusters

Question

I was experimenting if Spark with multi clusters can improve slow SQL query. I created two workers for master and they are running on local Spark Standalone. Yes, I did halve the memory and the number of cores to create workers on local machine. I specified partitions for sqlContext, using partitionColumn, lowerBound, UpperBoundand numberPartitions, so that tasks (or partitions) can be distributed over workers. I described them as below (partitionColumn is unique):

df = sqlContext.read.format("jdbc").options(
    url = "jdbc:sqlserver://localhost;databasename=AdventureWorks2014;integratedSecurity=true;", 
    driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver",  
    dbtable = query,
    partitionColumn = "RowId",
    lowerBound = 1,
    upperBound = 10000000,
    numPartitions = 4).load()

I ran my script over the master after specifying the options, but I couldn't get any performance improvement against when running on spark without cluster. I know I should have not halved the memory for integrity of the experiment. But I would like to know if that might be the case or any reason if that's not the case. Any thoughts are welcome. Many thanks.

zero323 · Accepted Answer

There are multiple factors which play a role here, though the weights of each of these can differ on a case by case basis.

As nicely pointed out by mtoto, increasing number of workers on a single machine, is unlikely to bring any performance gains.

Multiple workers on a single machine have access to the same fixed pool of resources. Since worker doesn't participate in the processing itself, you just use a higher fraction of this pool for management.

There legitimate cases when we prefer a higher number of executor JVMs, but it is not the same as increasing number of workers (the former one is an application resource, the latter one is a cluster resource).
It is not clear if you use the same number of cores for baseline and multi-worker configuration, nevertheless cores are not the only resource you have to consider working with Spark. Typical Spark jobs are IO (mostly network and disk) bound. Increasing number of threads on a single node, without making sure that there is sufficient disk and network configuration, will just make them wait for the data.

Increasing cores alone is useful only for jobs which are CPU bound (and these will typically scale better on a single machine).
Fiddling with Spark resources won't help you, if external resource cannot keep up with the requests. A high number of concurrent batch reads from a single non-replicated database will just throttle the server.

In this particular case you make it even worse by running a database server on the same node as Spark. It has some advantages (all traffic can go through loopback), but unless database and Spark use different sets of disks, they'll be competing over disk IO (and other resources as well).

Note:

It is not clear what is the query, but if it is slow when executed directly against database, fetching it from Spark will it even slower. You should probably take a closer look at query and/or database structure and configuration first.

Improving SQL Query using Spark Multi Clusters

Answers (1)

Related Questions