Reputation: 16172
I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.
Here is my approach to partitioning and writing the data:
df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))
df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')
This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.
Upvotes: 1
Views: 4921
Reputation: 299
If you use df.write.partitionBy('year','month', 'day')
.
These columns are not actually physically stored in file data. They simply are rendered via the folder structure that partitionBy
creates.
Ex. partitionBy('year').csv("/data")
will create something like:
/data/year=2018/part1---.csv
/data/year=2019/part1---.csv
When you read the data back it uses the special path year=xxx
to populate these columns.
You can prove it by reading in the data of a single partition directly.
Ex. year
will not be a column in this case.
df = spark.read.csv("data/year=2019/")
df.printSchema()
Also @Shu's answer could be used to investigate.
You can sleep safely that these columns are not taking up storage space.
If you really don't want to simply see the columns, you could put a view on top of this table that excludes these columns.
Upvotes: 1
Reputation: 31540
Spark/Hive won't write year,month,day
columns in your parquet files
as they are already in partitionBy clause.
Example:
val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.
Checking contents of csv file:
hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv
Output:
a
As you can see there is no id value
included in the csv file, in the same way if you write parquet file
partition columns are not included in the part-*.parquet file.
To check schema of parquet file:
parquet-tools schema <hdfs://nn:8020/parquet_file>
You can also verify what are all the columns included in your parquet file.
Upvotes: 3