Daniel Better
Daniel Better

Reputation: 1

Spark 3.0 change in behaviour when using partitionBy on Timestamp DataType column?

I'm currently in the middle of migrating our spark based service from 2.4.5 to 3.0.0.

I've noticed that there was a change in behavior when applying partition on a DataType of Timestamp value.

When writing the data-frame (parquet), everything works as expected (data is written with respect to the requested partitioning, but upon reading the saved data-frame, I see a truncation to Date DataType object which of course affects the downstream logic.

Example code:

val sparkImplicits = spark.implicits
import sparkImplicits._
val simpleData = Seq(
  (1, "abd", Timestamp.valueOf("2020-01-01 00:00:01")),
  (2, "def", Timestamp.valueOf("2019-01-01 00:00:02"))
)

val df = simpleData.toDF("id", "str", "timestamp")

df.printSchema()
df.show()
df.write.partitionBy("timestamp").parquet("partition_by_timestamp")

val readDF = spark.read.parquet("partition_by_timestamp")
readDF.printSchema()
readDF.show(2)

The output for the provided snippet:

root
 |-- id: integer (nullable = false)
 |-- str: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+---+---+-------------------+
| id|str|          timestamp|
+---+---+-------------------+
|  1|abd|2020-01-01 00:00:01|
|  2|def|2019-01-01 00:00:02|
+---+---+-------------------+

root
 |-- id: integer (nullable = true)
 |-- str: string (nullable = true)
 |-- timestamp: date (nullable = true)

+---+---+----------+
| id|str| timestamp|
+---+---+----------+
|  1|abd|2020-01-01|
|  2|def|2019-01-01|
+---+---+----------+

What is the origin of change and how should I go about keeping the value and type of the loaded dataframe column timestamp as Timestamp?

Upvotes: 0

Views: 628

Answers (1)

Lars Skaug
Lars Skaug

Reputation: 1386

Setting spark.sql.legacy.timeParserPolicy to LEGACY should do the trick.

As far as the reason for the change, it appears to have been done to align with other Apache projects such as pandas, R, and Apache Arrow, which all use the Proleptic Gregorian calendar.

For more details, see: https://databricks.com/blog/2020/07/22/a-comprehensive-look-at-dates-and-timestamps-in-apache-spark-3-0.html

Upvotes: 2

Related Questions