Save datetime into Parquet in Spark

Question

I want to create Parquet file with data of unknown format at compilation time. I'm getting schema as a text later on and I know that some columns contain date with time. I want to do this using Spark and Java. So I followed http://spark.apache.org/docs/1.2.1/sql-programming-guide.html#programmatically-specifying-the-schema and created schema with proper types. I tried to use Spark's DataType.TimestampType and DataType.DateType for date like columns. But neither of them is working. When I try to save the file with JavaSchemaRDD.saveAsParquetFile I'm getting the error Unsupported datatype+ the type that I tried for date. I tried this with emptyRDD so there isn't any trouble with data conversion.

After looking into: http://parquet.incubator.apache.org/documentation/latest/ and https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md I think that I need to convert the data into some integer/long type and inform about the fact that they represent Date. If so, how can I do this in Spark? Or maybe I need to do something else?

pfc · Accepted Answer

I'm facing exactly the same issue. It seems that DateType and Timestamp support for Parquet file will be added in Spark 1.3 (More info in https://github.com/apache/spark/pull/3820 and https://issues.apache.org/jira/browse/SPARK-4709).

Spark will use the INT96 type of Parquet to store Timestamp type (just like Impala).

Save datetime into Parquet in Spark

Answers (1)

Related Questions