Umesh Kacha
Umesh Kacha

Reputation: 13686

NPE while reading ORC file using Spark 1.4 API

I read many ORC files in Spark and process it those files are basically Hive partitions. Most of the times processing goes well but for few files I get the following exception don't know why? These files are working fine in Hive using Hive queries.

DataFrame df = hiveContext.read().format("orc").load("/path/in/hdfs");

java.lang.NullPointerException
    at org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:402)
    at org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:206)
    at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$8.apply(OrcRelation.scala:238)
    at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$8.apply(OrcRelation.scala:238)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:238)
    at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:290)
    at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:288)
    at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

Upvotes: 1

Views: 972

Answers (3)

Madhavi
Madhavi

Reputation: 11

You get this error when the directory structure for spark is not followed. consider a table named partitionedtable which is partitioned on partitionCol1,partitionCol2,partitionCol3

hdfs dfs -ls /path/in/hdfs/partitionedtable/

/path/in/hdfs/partitionedtable/partitionCol1=1/partitionCol2=11/partitionCol3=111/part-00000

/path/in/hdfs/partitionedtable/partitionCol1=2/partitionCol2=22/partitionCol3=222/part-00001

/path/in/hdfs/partitionedtable/partitionCol1=3/ --> this one doesnot have any data.

reference: https://issues.apache.org/jira/browse/SPARK-10304

Upvotes: 1

Nylon Smile
Nylon Smile

Reputation: 9456

I am getting the same error on Spark 1.6.1. Have not got to the bottom of the problem yet, but the first discovery is that only some hive partitions are not returning data (although they work and return perfectly fine using hive itself). That means if you remove the partition filter, or query another table all looks good.

Upvotes: 0

F. P. Beekhof
F. P. Beekhof

Reputation: 1

A NullPointerException is always a bug. Not in your code, but in the java-program you are using. So file a bug with apache spark.

Upvotes: 0

Related Questions