Christopher Settles
Christopher Settles

Reputation: 403

Spark complains of java.io.IOException: Null or empty fields is found

My executors throw exception with stacktrace

java.io.IOException: Null or empty fields is found
    at org.apache.parquet.crypto.CryptoMetadataRetriever.getFileEncryptionProperties(CryptoMetadataRetriever.java:114)
    at org.apache.parquet.crypto.CryptoClassLoader.getFileEncryptionPropertiesOrNull(CryptoClassLoader.java:74)
    at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:405)
    at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:362)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:163)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:253)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:440)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1371)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:446)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

When I attempt to save to a parquet file in the following way:

DataFrameWriter<Row> dfw =
        sparkSession.createDataFrame(javaSparkContext.parallelize(uuids), MyCustomDataClass.class).write();

Where uuids is of type ArrayList<MyCustomDataClass>.

How can I save to a hive table in parquet format without such error?

Upvotes: 0

Views: 226

Answers (1)

Christopher Settles
Christopher Settles

Reputation: 403

I fixed it by doing

DataFrameWriter<Row> dfw =
        sparkSession.createDataFrame(uuids, CustomDataClass.class).write();

(Removed the parallelize). Or, in general, make sure the Bean Class you pass matches the type in the List you pass to createDataFrame.

Where CustomDataClass also has getters for each of it's member data variables. Example:

import java.io.Serializable;

public class CustomDataClass implements Serializable {
  private String uuid;
  private Integer age;

  public CustomDataClass(String uuid, Integer timestamp) {
    this.uuid = uuid;
    this.age = timestamp;
  }

  public String getUuid() {
    return uuid;
  }

  public Integer getAge() {
    return age;
  }
}

Upvotes: 1

Related Questions