Reputation: 309
Kind of edge case, when saving parquet table in Spark SQL with partition,
#schema definitioin
final StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("time", DataTypes.StringType, true),
DataTypes.createStructField("accountId", DataTypes.StringType, true),
...
DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD);
df.coalesce(1)
.write()
.mode(SaveMode.Append)
.format("parquet")
.partitionBy("year")
.saveAsTable("tblclick8partitioned");
Spark warns:
Persisting partitioned data source relation into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive
In Hive:
hive> describe tblclick8partitioned;
OK
col array<string> from deserializer
Time taken: 0.04 seconds, Fetched: 1 row(s)
Obviously the schema is not correct - however if I use saveAsTable
in Spark SQL without partition the table can be queried without problem.
Question is how can I make a parquet table in Spark SQL compatible with Hive with partition info?
Upvotes: 14
Views: 23850
Reputation: 1027
A solution is to create the table with Hive and then save the data with ...partitionBy("year").insertInto("default.mytable")
.
In my experience, creating the table in Hive and then using ...partitionBy("year").saveAsTable("default.mytable")
did not work. This is with Spark 1.6.2.
Upvotes: 1
Reputation: 131
That's because DataFrame.saveAsTable creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable. An example from SPARK-14927 looks like this:
hc.sql("create external table tmp.partitiontest1(val string) partitioned by (year int)")
Seq(2012 -> "a", 2013 -> "b", 2014 -> "c").toDF("year", "val")
.write
.partitionBy("year")
.mode(SaveMode.Append)
.saveAsTable("tmp.partitiontest1")
Upvotes: 13