Evelina Dumitrescu
Evelina Dumitrescu

Reputation: 23

Incompatible configuration used between Spark and HBaseTestingUtility

We are using the MiniDFSCluster and MiniHbaseCluster from HBaseTestingUtility to run unit tests for our Spark jobs. The Spark configuration that we use is :

conf.set("spark.sql.catalogImplementation", "hive")
      .set("spark.sql.warehouse.dir", getWarehousePath)
      .set("javax.jdo.option.ConnectionURL", s"jdbc:derby:;databaseName=$getMetastorePath;create=true")
      .set("shark.test.data.path", dataFilePath)
      .set("hive.exec.dynamic.partition.mode", "nonstrict")
      .set("spark.kryo.registrator", "CustomKryoRegistrar")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.client.Result]))

For the MiniDFSCluster and MiniHbaseCluster we use the default HbaseTestingUtility configuration. The release versions that we use are:

In our unit tests, when we try to run a Spark job that reads Hive data, we get the following exception:

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): java.lang.Unsuppo
rtedOperationException: Byte-buffer read unsupported by org.apache.hadoop.fs.BufferedFSInputStream

        at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:158)

        at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:154)

        at org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:81)

        at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:90)

        at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:75)

        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:546)

        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:516)

        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:510)

        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:459)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$1(ParquetFileFormat.scala:370)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:374)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352)

        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)

        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)

        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)

        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)

        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)

        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:645)

        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:270)

        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:262)

        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)

        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)

        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)

        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)

        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

        at org.apache.spark.scheduler.Task.run(Task.scala:123)

        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:456)

        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1334)

        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:462)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:750)


Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1935)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1923)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1922)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1922)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:953)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:953)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:953)
  ...
  Cause: java.lang.UnsupportedOperationException: Byte-buffer read unsupported by org.apache.hadoop.fs.BufferedFSInputStream
  at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:158)
  at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:154)
  at org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:81)
  at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:90)
  at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:75)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:546)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:516)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:510)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:459)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371)

Is there an incompatible configuration used between Spark, MiniDFSCluster and MiniHbaseCluster ?

Upvotes: 0

Views: 41

Answers (0)

Related Questions