Incompatible configuration used between Spark and HBaseTestingUtility

Question

We are using the MiniDFSCluster and MiniHbaseCluster from HBaseTestingUtility to run unit tests for our Spark jobs. The Spark configuration that we use is :

conf.set("spark.sql.catalogImplementation", "hive")
      .set("spark.sql.warehouse.dir", getWarehousePath)
      .set("javax.jdo.option.ConnectionURL", s"jdbc:derby:;databaseName=$getMetastorePath;create=true")
      .set("shark.test.data.path", dataFilePath)
      .set("hive.exec.dynamic.partition.mode", "nonstrict")
      .set("spark.kryo.registrator", "CustomKryoRegistrar")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.client.Result]))

For the MiniDFSCluster and MiniHbaseCluster we use the default HbaseTestingUtility configuration. The release versions that we use are:

hbase-testing-util Cloudera CDP 2.4.6.7.2.16.0-287
Spark 2.11

In our unit tests, when we try to run a Spark job that reads Hive data, we get the following exception:

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor driver): java.lang.Unsuppo
rtedOperationException: Byte-buffer read unsupported by org.apache.hadoop.fs.BufferedFSInputStream

        at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:158)

        at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:154)

        at org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:81)

        at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:90)

        at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:75)

        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:546)

        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:516)

        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:510)

        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:459)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$1(ParquetFileFormat.scala:370)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:374)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352)

        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)

        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)

        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)

        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)

        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)

        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:645)

        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:270)

        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:262)

        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)

        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)

        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)

        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)

        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

        at org.apache.spark.scheduler.Task.run(Task.scala:123)

        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:456)

        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1334)

        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:462)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:750)


Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1935)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1923)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1922)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1922)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:953)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:953)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:953)
  ...
  Cause: java.lang.UnsupportedOperationException: Byte-buffer read unsupported by org.apache.hadoop.fs.BufferedFSInputStream
  at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:158)
  at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:154)
  at org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:81)
  at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:90)
  at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:75)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:546)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:516)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:510)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:459)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371)

Is there an incompatible configuration used between Spark, MiniDFSCluster and MiniHbaseCluster ?

Incompatible configuration used between Spark and HBaseTestingUtility

Answers (0)

Related Questions