Reputation: 88197
I am using PySpark on AWS Glue. It appears when writing a data set with date column used as partition key, its always converted into a string?
df = df \
.withColumn("querydatetime", to_date(df["querydatetime"], DATE_FORMAT_STR))
...
df \
.repartition("querydestinationplace", "querydatetime") \
.write \
.mode("overwrite") \
.partitionBy(["querydestinationplace", "querydatetime"]) \
.parquet("s3://xxx/flights-test")
I notice my table DDL from Athena
CREATE EXTERNAL TABLE `flights_test`(
`key` string,
`agent` int,
`queryoutbounddate` date,
`queryinbounddate` date,
`price` decimal(10,2),
`outdeparture` timestamp,
`indeparture` timestamp,
`numberoutstops` int,
`out_is_holiday` boolean,
`out_is_longweekends` boolean,
`in_is_holiday` boolean,
`in_is_longweekends` boolean)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://xxx/flights-test/'
TBLPROPERTIES (...)
Notice
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
Must the partition columns always be string? In fact querydestinationplace
should be an int type. Will this string type be less efficient than an Int or Date?
Upvotes: 1
Views: 1506
Reputation: 848
This is a known behavior of paquet. You can add the following line before reading the parquet file to omit this behavior:
# prevent casting the integer id fields, which are used for patitioning,
# to be converted to integers.
sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
Upvotes: 2