Jiew Meng
Jiew Meng

Reputation: 88197

PySpark/Glue: When using a date column as a partition key, its always converted into String?

I am using PySpark on AWS Glue. It appears when writing a data set with date column used as partition key, its always converted into a string?

df = df \
  .withColumn("querydatetime", to_date(df["querydatetime"], DATE_FORMAT_STR))
...
df \
  .repartition("querydestinationplace", "querydatetime") \
  .write \
  .mode("overwrite") \
  .partitionBy(["querydestinationplace", "querydatetime"]) \
  .parquet("s3://xxx/flights-test")

I notice my table DDL from Athena

CREATE EXTERNAL TABLE `flights_test`(
  `key` string, 
  `agent` int, 
  `queryoutbounddate` date, 
  `queryinbounddate` date, 
  `price` decimal(10,2), 
  `outdeparture` timestamp, 
  `indeparture` timestamp, 
  `numberoutstops` int, 
  `out_is_holiday` boolean, 
  `out_is_longweekends` boolean, 
  `in_is_holiday` boolean, 
  `in_is_longweekends` boolean)
PARTITIONED BY ( 
  `querydestinationplace` string, 
  `querydatetime` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://xxx/flights-test/'
TBLPROPERTIES (...)

Notice

PARTITIONED BY ( 
  `querydestinationplace` string, 
  `querydatetime` string)

Must the partition columns always be string? In fact querydestinationplace should be an int type. Will this string type be less efficient than an Int or Date?

Upvotes: 1

Views: 1506

Answers (1)

Martin
Martin

Reputation: 848

This is a known behavior of paquet. You can add the following line before reading the parquet file to omit this behavior:

# prevent casting the integer id fields, which are used for patitioning, 
# to be converted to integers.
sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

Upvotes: 2

Related Questions