Fandy_Chen
Fandy_Chen

Reputation: 31

The 's3' resource does not exist

I used Python and boto3 to process some S3 files on the spark, and when I downloaded the files, it was unusual: The 's3'resource does not exist.

Because boto3 is not installed on each cluster node, I packaged the dependency packages used by boto3 as zip and used the spark cluster submitted by -- py-files, and this exception occurred.

    Py4JJavaErrorTraceback (most recent call last)
<ipython-input-3-8147865bf49c> in <module>()
      2 
      3 
----> 4 extractor.extract(paths)

/usr/local/lib/python2.7/site-packages/extract-1.0-py2.7.egg/extract.pyc in extract(self, target_files_path)
     52         try:
     53             sc = self.get_spark_context()
---> 54             self._extract_file(sc, target_files_path)
     55         finally:
     56             if sc:

/usr/local/lib/python2.7/site-packages/extract-1.0-py2.7.egg/extract.pyc in _extract_file(self, sc, target_files_path)
    109     def _extract_file(self, sc, target_files_path):
    110         file_rdd = sc.parallelize(target_files_path, len(target_files_path))
--> 111         result_rdd = file_rdd.map(lambda file_path: self.process(file_path, self.func)).collect()
    112         result_rdd.saveAsTextFile(self.result_path)
    113 

/usr/lib/spark/python/pyspark/rdd.py in collect(self)
    769         """
    770         with SCCallSiteSync(self.context) as css:
--> 771             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    772         return list(_load_from_socket(port, self._jrdd_deserializer))
    773 

/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 15, ip-172-20-219-210.corp.hpicloud.net): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python2.7/site-packages/extract-1.0-py2.7.egg/extract.py", line 111, in <lambda>
  File "./lib.zip/extract.py", line 115, in process
    local_path = download_file_from_s3(self.app_name, file_path)
  File "./lib.zip/extract.py", line 22, in download_file_from_s3
    s3 = boto3.resource('s3')
  File "./lib.zip/boto3/__init__.py", line 100, in resource
    return _get_default_session().resource(*args, **kwargs)
  File "./lib.zip/boto3/session.py", line 347, in resource
    has_low_level_client)
ResourceNotExistsError: The 's3' resource does not exist.
The available resources are:
   - 


    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/chqiang/appcache/application_1521024688288_67008/container_1521024688288_67008_01_000002/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python2.7/site-packages/extract-1.0-py2.7.egg/extract.py", line 111, in <lambda>
  File "./lib.zip/extract.py", line 115, in process
    local_path = download_file_from_s3(self.app_name, file_path)
  File "./lib.zip/extract.py", line 22, in download_file_from_s3
    s3 = boto3.resource('s3')
  File "./lib.zip/boto3/__init__.py", line 100, in resource
    return _get_default_session().resource(*args, **kwargs)
  File "./lib.zip/boto3/session.py", line 347, in resource
    has_low_level_client)
ResourceNotExistsError: The 's3' resource does not exist.
The available resources are:
   - 


    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    ... 1 more

Could you please help me? thank you!

Upvotes: 3

Views: 4337

Answers (3)

Chinmay Kulkarni
Chinmay Kulkarni

Reputation: 47

This happens because botocore is not able to function properly from within a zip file. Instead of passing your dependencies as --py-files, pass it as mentioned here.

spark-submit --master yarn
    ...(other options)
    --conf spark.yarn.appMasterEnv.PYTHONPATH="deps" \
    --conf spark.executorEnv.PYTHONPATH="deps" \
    --conf spark.yarn.dist.archives="{dependencies zip file path}#deps" \
    main.py

Upvotes: 1

Parul Priya
Parul Priya

Reputation: 31

I ran into the same error today. I tried fixing it using client instead of resource, however that gave out another error.

botocore.exceptions.DataNotFoundError: Unable to load data for: endpoints

I googled a bit and came to the conclusion that it is not possible to zip up the boto3 package since packaging it results in loss of a few .json files. The endpoints.json is one of them. It exists in your botocore/data directory however when you zip it, it doesn't have it.

The solution is to install boto3 during bootstrap. You can create a bootstrap file and provide it during the cluster build-up process (there is an option in the AWS EMR console for it). This will install boto3 on the master node and all of the slave nodes.

You can refer to the following answer as well : boto3 cannot create client on pyspark worker?

Upvotes: 3

Fandy_Chen
Fandy_Chen

Reputation: 31

I packaged the dependency packages as whl files instead of zip packages, then added them all to the --py-files parameter(eg. a.whl,b.whl,c.whl), changed s3=boto3.resource('s3') in the code to client=boto3.client('s3') and tried to succeed

Upvotes: 0

Related Questions