Reputation: 31
I am trying to read an avro file with DataFrame, but keep getting:
org.apache.spark.sql.avro.IncompatibleSchemaException: Unsupported type NULL
Since I am going to deploy it on Dataproc I am using Spark 2.4.0, but the same happened when I tried other versions.
Following is my dependencies:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
My main class:
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf()
.setAppName("Example");
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.getOrCreate();
Dataset<Row> rowDataset = spark.read().format("avro").load("avro_file");
}
Running command:
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 --master local[*] --class MainClass my-spak-app.jar
After running a lot of tests I concluded that it happens because I have in my avro schema a field defined with "type": "null". I am not creating the files I am working on so I can't change the schema. I am able to read the files when I am using RDD and read the file with newAPIHadoopFile method.
Is there a way to read avro files with "type": "null" using Dataframe or I will have to work with RDD?
Upvotes: 2
Views: 1424
Reputation: 71
You can specify a schema when you read the file. Create a schema for your file
val ACCOUNT_schema = StructType(List(
StructField("XXX",DateType,true),
StructField("YYY",StringType,true))
val rowDataset = spark.read().format("avro").option("avroSchema", schema).load("avro_file");
I am not very familiar with java syntax, but I think you can manage it.
Upvotes: 2