Reputation: 81
Is there a way to ensure that my spark executors are co located with my Hbase region servers ? In the Spark-on-HBase by horton works it is mentioned as below :
We assume Spark and HBase are deployed in the same cluster, and Spark executors are co-located with region servers
Is there a way of achieving the same ? If i use
sparkContext.newHadoopApi()
will it ensure data locality ??
Upvotes: 3
Views: 696
Reputation: 528
Our experience at Splice Machine was that for large queries and systems where many analytical queries and compactions were running in Spark, we would achieve decent locality(95+%). We are using Spark on Yarn where the executors are dynamically allocated and shrink after a certain time period of inactivity. There were a couple of issues we had to work around however.
Single Spark Context across operations. We built a single spark context server so all of our queries could have effective resource management and locality. If you create many Spark Contexts, the executor resources can be blocked from executing on certain nodes.
If someone runs a medium sized query after a period of inactivity, there is a strong likelihood that all nodes where data resides will not have executors dynamically allocated.
We leaned heavily on our implementation of Compactions On Spark and an approach of reading store files directly from Spark (vs. Hbase remote scan) with incremental deltas from the HBase memstore. The compactions created better locality and demand on the executor tasks. The direct reading of store files allowed for locality to be based on file locality (2-3 Copies) vs. being local to the single region server (only 1 server).
We wrote our own Split Mechanism because the default hbase split by region size caused long tails and serious memory issues. For example, we would have a table with 6 regions ranging from (20M to 4 Gigs). The 4 gig region would be the long tail. Certain Spark operations would expect an executor to be able to load the entire 4 gig into memory causing memory issues. With our own split mechanism, we in essence put a bound on the amount of data to scan and put in memory...
If you need more detail about what we did, check out
http://community.splicemachine.com/
We are open source and you can check out our code at
https://github.com/splicemachine/spliceengine
Good luck Vidya...
Cheers, John
Upvotes: 1