Reputation: 1262
I want to create Parquet file with data of unknown format at compilation time. I'm getting schema as a text later on and I know that some columns contain date with time. I want to do this using Spark and Java. So I followed http://spark.apache.org/docs/1.2.1/sql-programming-guide.html#programmatically-specifying-the-schema and created schema with proper types. I tried to use Spark's DataType.TimestampType
and DataType.DateType
for date like columns. But neither of them is working. When I try to save the file with JavaSchemaRDD.saveAsParquetFile
I'm getting the error Unsupported datatype
+ the type that I tried for date. I tried this with emptyRDD
so there isn't any trouble with data conversion.
After looking into: http://parquet.incubator.apache.org/documentation/latest/ and https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md I think that I need to convert the data into some integer/long type and inform about the fact that they represent Date. If so, how can I do this in Spark? Or maybe I need to do something else?
Upvotes: 4
Views: 9446
Reputation: 90
I'm facing exactly the same issue. It seems that DateType and Timestamp support for Parquet file will be added in Spark 1.3 (More info in https://github.com/apache/spark/pull/3820 and https://issues.apache.org/jira/browse/SPARK-4709).
Spark will use the INT96 type of Parquet to store Timestamp type (just like Impala).
Upvotes: 4