Spark 3.0 change in behaviour when using partitionBy on Timestamp DataType column?

Question

I'm currently in the middle of migrating our spark based service from 2.4.5 to 3.0.0.

I've noticed that there was a change in behavior when applying partition on a DataType of Timestamp value.

When writing the data-frame (parquet), everything works as expected (data is written with respect to the requested partitioning, but upon reading the saved data-frame, I see a truncation to Date DataType object which of course affects the downstream logic.

Example code:

val sparkImplicits = spark.implicits
import sparkImplicits._
val simpleData = Seq(
  (1, "abd", Timestamp.valueOf("2020-01-01 00:00:01")),
  (2, "def", Timestamp.valueOf("2019-01-01 00:00:02"))
)

val df = simpleData.toDF("id", "str", "timestamp")

df.printSchema()
df.show()
df.write.partitionBy("timestamp").parquet("partition_by_timestamp")

val readDF = spark.read.parquet("partition_by_timestamp")
readDF.printSchema()
readDF.show(2)

The output for the provided snippet:

root
 |-- id: integer (nullable = false)
 |-- str: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+---+---+-------------------+
| id|str|          timestamp|
+---+---+-------------------+
|  1|abd|2020-01-01 00:00:01|
|  2|def|2019-01-01 00:00:02|
+---+---+-------------------+

root
 |-- id: integer (nullable = true)
 |-- str: string (nullable = true)
 |-- timestamp: date (nullable = true)

+---+---+----------+
| id|str| timestamp|
+---+---+----------+
|  1|abd|2020-01-01|
|  2|def|2019-01-01|
+---+---+----------+

What is the origin of change and how should I go about keeping the value and type of the loaded dataframe column timestamp as Timestamp?

Spark 3.0 change in behaviour when using partitionBy on Timestamp DataType column?

Answers (1)

Related Questions