Bogdan N
Bogdan N

Reputation: 771

Apache Spark - passing configuration to nodes

I have an application where I am using Apache Spark 1.4.1 - Standalone cluster. Code from this application has evolved and it's quite complicated (more than a few lines of code as we see in most Apache Spark examples), with lots of method calls from one class to another.

I am trying to add code that when encounters a problem with data (while processing it on the cluster nodes) it notifies an external application. For contacting the external application we have connection details setup in a config file. I want to pass somehow the connection details to the cluster nodes but passing them as parameters to each method that runs on nodes (as parameters or broadcast variable) is not ok for my application because it means that each and every method has to pass them and we've got lots of "chained method calls" (method A calls B, B calls C.....Y calls Z) which is different from most Apache Spark example where we see only one or two method calls.

I am trying to workaround this problem - is there a way to pass data to nodes besides method parameters and broadcast variables ? For example I was looking to setup a env property that would point to the config file (using System.setProperty) and to set it on all nodes, so that I can read connection details on the fly and the code would isolated in one block of code only, but I've got no luck so far.

Upvotes: 1

Views: 2953

Answers (3)

nojka_kruva
nojka_kruva

Reputation: 1454

I suggest the following solution:

  1. Put the configuration in a database.
  2. Put the database connection details in a JOCL (Java Object configuration Language) file and have this file available on the class path of each executors.
  3. Make a singleton class that reads the DB connection details from the JOCL, connects to the database, extracts the configuration info and exposes it as getter methods.
  4. Import the class into the context where you have your Spark calls and use it to access the configuration from within them.

Upvotes: -1

saravanan
saravanan

Reputation: 11

The properties you provide as part of --properties-file will be loaded at runtime and will be available only as part of driver but not on any of the executors. But you can always make it available to the executors.

Simple hack:

private static String getPropertyString(String key, Boolean mandatory){
        String value=sparkConf.get(key,null);
        if(mandatory && value == null ){
            value = sparkConf.getenv(key);
            if(value == null)
            shutDown(key); // Or whatever action you would like to take
        }
        if(value !=null && sparkConf.getenv(key)==null )
            sparkConf.setExecutorEnv(key,value);
        return value;
    }

First time when your driver kicks, it will find all the properties provided from properties file from sparkconf. As soon as it finds, check whether that key already present in environment if not set those values to executors using setExecutorEnv in your program. Its tough to distinguish whether your program is in driver or in executor so check whether the property exists in sparkconf if not then check it against environment using getenv(key).

Upvotes: 1

Bogdan N
Bogdan N

Reputation: 771

Actually after some hours of investigation I found a way that really suits my needs. There are two spark properties (one for driver, one for executors) that can be used for passing parameters that can be then read using System.getProperty() :

  • spark.executor.extraJavaOptions
  • spark.driver.extraJavaOptions

Using them is more simpler than the approach suggested in above post and you could easily make your application to switch configuration from one environment to another (e.g QA/DEV vs PROD) when you've got all environment setup in your project. They can be set in the SparkConf object when you're initializing the SparkContext.

The post that helped me a lot in figuring the solution is : http://progexc.blogspot.co.uk/2014/12/spark-configuration-mess-solved.html

Upvotes: 3

Related Questions