Bluemix Apache Spark service with spark-submit. Uploading Data to Object Storage

Question

I have been using spark-submit to run some simple jobs on the Bluemix spark service (Word Count, SparkPi). Both run fine. I used a small text file to test Word Count through spark-submit.sh (uploaded the file using --file). However, when I used a large file instead, the job did not run. I looked at the logs and saw the message "413 Request Entity Too Large".

I assume this means that the file is too large to submit. So I have 3 questions.

Can I increase the limit on the size of the file I am allowed to upload through spark-submit?
Can I link my application to the existing swift object storage and just upload my large file there?

On question 2, I have done some initial research and it seems I need to add credentials in to the request to access object storage. So one additional question.

Is there any way I can incorporate these credentials without altering the application source code? (Like adding the credentials in to spark-submit like vcap.json)?

I appreciate your time. I would not ask if I did not need to.

Randy Horman · Accepted Answer

re: "413 Request Entity Too Large"

The Bluemix Apache Spark service is a compute service only, meaning that your data should reside in a storage service, like Bluemix Object Storage service, Cloudant, S3, whatever makes sense. Then your spark-submit program will connect to that storage service and you create RDD over that and go to town. In your case, you are trying to pass in the data on which you want to run analytics via the --files parameter of spark-submit, and the service is complaining that you're doing it wrong ;-) spark-submit will allow you to pass in your spark program, libs, and some small files that your program needs to run, but it will reject any larger files as not acceptable; max size is somehow around 200MB presently, but that could change ;-)

You can certainly code up, in your spark program, the configuration required to access your object store account; the creds and endpoint configuration properties set via hadoop connector config, as per the following python example:

def set_hadoop_config(creds):
    prefix = "fs.swift.service." + creds['name']
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + ".auth.url", creds['auth_url'] + '/v2.0/tokens')
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", creds['project_id'])
    hconf.set(prefix + ".username", creds['user_id'])
    hconf.set(prefix + ".password", creds['password'])
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", creds['region'])
    hconf.setBoolean(prefix + ".public", True)

Presently, only the Analytic Notebooks help you out by setting up configuration automatically for you when you bind an object storage service to the notebook service instance. I expect in the future this may also be possible with spark-submit ;-)

Bluemix Apache Spark service with spark-submit. Uploading Data to Object Storage

Answers (1)

Related Questions