codekoriko
codekoriko

Reputation: 870

pyspark: read partitioned parquet "my_file.parquet/col1=NOW" string value replaced by <current_time> on read()

With pyspark 3.1.1 on wsl Debian 10

When reading parquet file partitioned with a column containing the string NOW, the string is replaced by the current time at the moment of the read() funct is executed. I suppose that NOW string interpreted as now()

# step to reproduce
df = spark.createDataFrame(data=[("NOW",1), ("TEST", 2)], schema = ["col1", "id"])
df.write.partitionBy("col1").parquet("test/test.parquet")
>>> /home/test/test.parquet/col1=NOW

df_loaded = spark.read.option(
 "basePath",
 "test/test.parquet",
).parquet("test/test.parquet/col1=*")
df_loaded.show(truncate=False)
>>> 
+---+--------------------------+
|id |col1                      |
+---+--------------------------+
|2  |TEST                      |
|1  |2021-04-18 14:36:46.532273|
+---+--------------------------+

Is that a bug or a normal function of pyspark? if the latter, is there a sparkContext option to avoid that behaviour?

Upvotes: 1

Views: 55

Answers (1)

mck
mck

Reputation: 42352

I suspect that's an expected feature... but I'm not sure where it was documented. Anyway, if you want to keep the column as a string column, you can provide a schema while reading the parquet file:

df = spark.read.schema("id long, col1 string").parquet("test/test.parquet")

df.show()
+---+----+
| id|col1|
+---+----+
|  1| NOW|
|  2|TEST|
+---+----+

Upvotes: 1

Related Questions