Reputation: 1
I'm currently in the middle of migrating our spark based service from 2.4.5 to 3.0.0.
I've noticed that there was a change in behavior when applying partition
on a DataType of Timestamp
value.
When writing the data-frame (parquet), everything works as expected (data is written with respect to the requested partitioning, but upon reading the saved data-frame, I see a truncation to Date
DataType object which of course affects the downstream logic.
Example code:
val sparkImplicits = spark.implicits
import sparkImplicits._
val simpleData = Seq(
(1, "abd", Timestamp.valueOf("2020-01-01 00:00:01")),
(2, "def", Timestamp.valueOf("2019-01-01 00:00:02"))
)
val df = simpleData.toDF("id", "str", "timestamp")
df.printSchema()
df.show()
df.write.partitionBy("timestamp").parquet("partition_by_timestamp")
val readDF = spark.read.parquet("partition_by_timestamp")
readDF.printSchema()
readDF.show(2)
The output for the provided snippet:
root
|-- id: integer (nullable = false)
|-- str: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
+---+---+-------------------+
| id|str| timestamp|
+---+---+-------------------+
| 1|abd|2020-01-01 00:00:01|
| 2|def|2019-01-01 00:00:02|
+---+---+-------------------+
root
|-- id: integer (nullable = true)
|-- str: string (nullable = true)
|-- timestamp: date (nullable = true)
+---+---+----------+
| id|str| timestamp|
+---+---+----------+
| 1|abd|2020-01-01|
| 2|def|2019-01-01|
+---+---+----------+
What is the origin of change and how should I go about keeping the value and type of the loaded dataframe column timestamp as Timestamp
?
Upvotes: 0
Views: 628
Reputation: 1386
Setting spark.sql.legacy.timeParserPolicy
to LEGACY
should do the trick.
As far as the reason for the change, it appears to have been done to align with other Apache projects such as pandas, R, and Apache Arrow, which all use the Proleptic Gregorian calendar.
For more details, see: https://databricks.com/blog/2020/07/22/a-comprehensive-look-at-dates-and-timestamps-in-apache-spark-3-0.html
Upvotes: 2