Reputation: 96
I have been using spark-submit to run some simple jobs on the Bluemix spark service (Word Count, SparkPi). Both run fine. I used a small text file to test Word Count through spark-submit.sh (uploaded the file using --file). However, when I used a large file instead, the job did not run. I looked at the logs and saw the message "413 Request Entity Too Large".
I assume this means that the file is too large to submit. So I have 3 questions.
Can I increase the limit on the size of the file I am allowed to upload through spark-submit?
Can I link my application to the existing swift object storage and just upload my large file there?
On question 2, I have done some initial research and it seems I need to add credentials in to the request to access object storage. So one additional question.
I appreciate your time. I would not ask if I did not need to.
Upvotes: 0
Views: 429
Reputation: 430
re: "413 Request Entity Too Large"
The Bluemix Apache Spark service is a compute service only, meaning that your data should reside in a storage service, like Bluemix Object Storage service, Cloudant, S3, whatever makes sense. Then your spark-submit
program will connect to that storage service and you create RDD over that and go to town. In your case, you are trying to pass in the data on which you want to run analytics via the --files
parameter of spark-submit
, and the service is complaining that you're doing it wrong ;-) spark-submit
will allow you to pass in your spark program, libs, and some small files that your program needs to run, but it will reject any larger files as not acceptable; max size is somehow around 200MB presently, but that could change ;-)
You can certainly code up, in your spark program, the configuration required to access your object store account; the creds and endpoint configuration properties set via hadoop connector config, as per the following python example:
def set_hadoop_config(creds):
prefix = "fs.swift.service." + creds['name']
hconf = sc._jsc.hadoopConfiguration()
hconf.set(prefix + ".auth.url", creds['auth_url'] + '/v2.0/tokens')
hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
hconf.set(prefix + ".tenant", creds['project_id'])
hconf.set(prefix + ".username", creds['user_id'])
hconf.set(prefix + ".password", creds['password'])
hconf.setInt(prefix + ".http.port", 8080)
hconf.set(prefix + ".region", creds['region'])
hconf.setBoolean(prefix + ".public", True)
Presently, only the Analytic Notebooks help you out by setting up configuration automatically for you when you bind an object storage service to the notebook service instance. I expect in the future this may also be possible with spark-submit
;-)
Upvotes: 1