Null pointer exception when trying to fetch data from S3 using pyspark

Question

I am getting a nullpointer exception when I am trying to get data from S3 using pyspark. I am running spark 1.6.1 with hadoop 2.4. I tried using both s3n and s3a. Tried setting the configurations in the following way as well:

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl",     "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "aws-key")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "aws-secret-key")

Made sure that the bucket had permission for authenticated users.

>>> myRDD = sc.textFile("s3n://aws-key:aws-secret-key@my-bucket/data.csv-000").count()

16/11/10 18:37:50 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 157.2 KB, free 1755.2 KB)
16/11/10 18:37:50 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 17.0 KB, free 1772.2 KB)
16/11/10 18:37:50 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:61806 (size: 17.0 KB, free: 510.9 MB)
16/11/10 18:37:50 INFO SparkContext: Created broadcast 10 from textFile at NativeMethodAccessorImpl.java:-2
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 1004, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 995, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 869, in fold
    vals = self.mapPartitions(func).collect()
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/rdd.py", line 771, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__

  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/Users/skalyanpur/spark-1.6.1-bin-hadoop2.4/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.NullPointerException
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
    at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
    at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

stevel · Accepted Answer

It's not that clear what caused the failure; the ine where the exception was raised doesn't show anything obvious.

My recommendation would be to switch to s3a, which is the S3 connector which we in the ASF projects are currently maintaining; s3n is being left alone as the 100% bug-for-bug backwards compatible connector.

s3a isn't going to work as it's not in Hadoop-2.4; it came in with Hadoop-2.6 and reached production-ready state by Hadoop 2.7.1. Grab a version of spark built against that and you should see your life better. And, if not: you can file bug reports against issues.apache.org that won't get closed as WONTFIX.

ps. you don't need to include your AWS user:secret in URLs if you've set the properties in your configuration; this will help keep your secrets out of the logs.

Null pointer exception when trying to fetch data from S3 using pyspark

Answers (1)

Related Questions