Reputation: 403
My executors throw exception with stacktrace
java.io.IOException: Null or empty fields is found
at org.apache.parquet.crypto.CryptoMetadataRetriever.getFileEncryptionProperties(CryptoMetadataRetriever.java:114)
at org.apache.parquet.crypto.CryptoClassLoader.getFileEncryptionPropertiesOrNull(CryptoClassLoader.java:74)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:405)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:362)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:163)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:253)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:440)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1371)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:446)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
When I attempt to save to a parquet file in the following way:
DataFrameWriter<Row> dfw =
sparkSession.createDataFrame(javaSparkContext.parallelize(uuids), MyCustomDataClass.class).write();
Where uuids
is of type ArrayList<MyCustomDataClass>
.
How can I save to a hive table in parquet format without such error?
Upvotes: 0
Views: 226
Reputation: 403
I fixed it by doing
DataFrameWriter<Row> dfw =
sparkSession.createDataFrame(uuids, CustomDataClass.class).write();
(Removed the parallelize). Or, in general, make sure the Bean Class you pass matches the type in the List you pass to createDataFrame.
Where CustomDataClass also has getters for each of it's member data variables. Example:
import java.io.Serializable;
public class CustomDataClass implements Serializable {
private String uuid;
private Integer age;
public CustomDataClass(String uuid, Integer timestamp) {
this.uuid = uuid;
this.age = timestamp;
}
public String getUuid() {
return uuid;
}
public Integer getAge() {
return age;
}
}
Upvotes: 1