Drop partition columns when writing parquet in pyspark

Question

I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.

Here is my approach to partitioning and writing the data:

df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))

df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')

This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.

notNull · Accepted Answer

Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.

Example:

val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.

Checking contents of csv file:

hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv

Output:

As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.

To check schema of parquet file:

parquet-tools schema

You can also verify what are all the columns included in your parquet file.

Drop partition columns when writing parquet in pyspark

Answers (2)

Related Questions