Reputation: 31
I used Python and boto3 to process some S3 files on the spark, and when I downloaded the files, it was unusual: The 's3'resource does not exist.
Because boto3 is not installed on each cluster node, I packaged the dependency packages used by boto3 as zip and used the spark cluster submitted by -- py-files, and this exception occurred.
Py4JJavaErrorTraceback (most recent call last)
<ipython-input-3-8147865bf49c> in <module>()
2
3
----> 4 extractor.extract(paths)
/usr/local/lib/python2.7/site-packages/extract-1.0-py2.7.egg/extract.pyc in extract(self, target_files_path)
52 try:
53 sc = self.get_spark_context()
---> 54 self._extract_file(sc, target_files_path)
55 finally:
56 if sc:
/usr/local/lib/python2.7/site-packages/extract-1.0-py2.7.egg/extract.pyc in _extract_file(self, sc, target_files_path)
109 def _extract_file(self, sc, target_files_path):
110 file_rdd = sc.parallelize(target_files_path, len(target_files_path))
--> 111 result_rdd = file_rdd.map(lambda file_path: self.process(file_path, self.func)).collect()
112 result_rdd.saveAsTextFile(self.result_path)
113
/usr/lib/spark/python/pyspark/rdd.py in collect(self)
769 """
770 with SCCallSiteSync(self.context) as css:
--> 771 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
772 return list(_load_from_socket(port, self._jrdd_deserializer))
773
/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:
/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 15, ip-172-20-219-210.corp.hpicloud.net): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/lib/python2.7/site-packages/extract-1.0-py2.7.egg/extract.py", line 111, in <lambda>
File "./lib.zip/extract.py", line 115, in process
local_path = download_file_from_s3(self.app_name, file_path)
File "./lib.zip/extract.py", line 22, in download_file_from_s3
s3 = boto3.resource('s3')
File "./lib.zip/boto3/__init__.py", line 100, in resource
return _get_default_session().resource(*args, **kwargs)
File "./lib.zip/boto3/session.py", line 347, in resource
has_low_level_client)
ResourceNotExistsError: The 's3' resource does not exist.
The available resources are:
-
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/lib/python2.7/site-packages/extract-1.0-py2.7.egg/extract.py", line 111, in <lambda>
File "./lib.zip/extract.py", line 115, in process
local_path = download_file_from_s3(self.app_name, file_path)
File "./lib.zip/extract.py", line 22, in download_file_from_s3
s3 = boto3.resource('s3')
File "./lib.zip/boto3/__init__.py", line 100, in resource
return _get_default_session().resource(*args, **kwargs)
File "./lib.zip/boto3/session.py", line 347, in resource
has_low_level_client)
ResourceNotExistsError: The 's3' resource does not exist.
The available resources are:
-
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Could you please help me? thank you!
Upvotes: 3
Views: 4337
Reputation: 47
This happens because botocore is not able to function properly from within a zip file. Instead of passing your dependencies as --py-files
, pass it as mentioned here.
spark-submit --master yarn
...(other options)
--conf spark.yarn.appMasterEnv.PYTHONPATH="deps" \
--conf spark.executorEnv.PYTHONPATH="deps" \
--conf spark.yarn.dist.archives="{dependencies zip file path}#deps" \
main.py
Upvotes: 1
Reputation: 31
I ran into the same error today. I tried fixing it using client instead of resource, however that gave out another error.
botocore.exceptions.DataNotFoundError: Unable to load data for: endpoints
I googled a bit and came to the conclusion that it is not possible to zip up the boto3 package since packaging it results in loss of a few .json
files. The endpoints.json
is one of them. It exists in your botocore/data
directory however when you zip it, it doesn't have it.
The solution is to install boto3 during bootstrap. You can create a bootstrap file and provide it during the cluster build-up process (there is an option in the AWS EMR console for it). This will install boto3 on the master node and all of the slave nodes.
You can refer to the following answer as well : boto3 cannot create client on pyspark worker?
Upvotes: 3
Reputation: 31
I packaged the dependency packages as whl files instead of zip packages, then added them all to the --py-files
parameter(eg. a.whl,b.whl,c.whl
), changed s3=boto3.resource('s3')
in the code to client=boto3.client('s3')
and tried to succeed
Upvotes: 0