cucucool
cucucool

Reputation: 3887

Parquet data and partition issue in Spark Structured streaming

I am using Spark Structured streaming; My DataFrame has the following schema

root 
 |-- data: struct (nullable = true) 
 |    |-- zoneId: string (nullable = true) 
 |    |-- deviceId: string (nullable = true) 
 |    |-- timeSinceLast: long (nullable = true) 
 |-- date: date (nullable = true) 

How can I do a writeStream with Parquet format and write the data (containing zoneId, deviceId, timeSinceLast; everything except date) and partition the data by date ? I tried the following code and the partition by clause did not work

val query1 = df1 
  .writeStream 
  .format("parquet") 
  .option("path", "/Users/abc/hb_parquet/data") 
  .option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint") 
  .partitionBy("data.zoneId") 
  .start() 

Upvotes: 2

Views: 7887

Answers (2)

Yuriy Bondaruk
Yuriy Bondaruk

Reputation: 4750

If you want to partition by date then you have to use it in partitionBy() method.

val query1 = df1 
  .writeStream 
  .format("parquet") 
  .option("path", "/Users/abc/hb_parquet/data") 
  .option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint") 
  .partitionBy("date") 
  .start()

In case if you want to partition data structured by <year>/<month>/<day> you should make sure that the date column is of DateType type and then create columns appropriately formatted:

val df = dataset.withColumn("date", dataset.col("date").cast(DataTypes.DateType))

df.withColumn("year", functions.date_format(df.col("date"), "YYYY"))
  .withColumn("month", functions.date_format(df.col("date"), "MM"))
  .withColumn("day", functions.date_format(df.col("date"), "dd"))
  .writeStream 
  .format("parquet") 
  .option("path", "/Users/abc/hb_parquet/data") 
  .option("checkpointLocation", "/Users/abc/hb_parquet/checkpoint") 
  .partitionBy("year", "month", "day")
  .start()

Upvotes: 5

belka
belka

Reputation: 1530

I think you should try the method repartition which can take two kinds of arguments:

  • column name
  • number of wanted partitions.

I suggest using repartition("date") to partition your data by date.

A great link on the subject: https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

Upvotes: 1

Related Questions