suriyanto
suriyanto

Reputation: 1095

Avoid losing data type for the partitioned data when writing from Spark

I have a dataframe like below.

itemName, itemCategory
Name1, C0
Name2, C1
Name3, C0

I would like to save this dataframe as partitioned parquet file:

df.write.mode("overwrite").partitionBy("itemCategory").parquet(path)

For this dataframe, when I read the data back, it will have String the data type for itemCategory.

However at times, I have dataframe from other tenants as below.

itemName, itemCategory
Name1, 0
Name2, 1
Name3, 0

In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory.

Parquet file has the metadata that describe the data type. How can I specify the data type for the partition so it will be read back as String instead of Int?

Upvotes: 8

Views: 7206

Answers (3)

Dvir Yitzchaki
Dvir Yitzchaki

Reputation: 576

Read it with a schema:

import spark.implicits._
val path = "/tmp/test/input"
val source = Seq(("Name1", "0"), ("Name2", "1"), ("Name3", "0")).toDF("itemName", "itemCategory")
source.write.partitionBy("itemCategory").parquet(path)
spark.read.schema(source.schema).parquet(path).printSchema() 
// will print 
// root
// |-- itemName: string (nullable = true)
// |-- itemCategory: string (nullable = true)

See https://www.zepl.com/viewer/notebooks/bm90ZTovL2R2aXJ0ekBnbWFpbC5jb20vMzEzZGE2ZmZjZjY0NGRiZjk2MzdlZDE4NjEzOWJlZWYvbm90ZS5qc29u

Upvotes: 0

David Schuler
David Schuler

Reputation: 1031

If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.

In spark 2.0 or greater, you can set like this:

spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

In 1.6, like this:

sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

The downside is you have to do this each time you read the data, but at least it works.

Upvotes: 9

Shaido
Shaido

Reputation: 28322

As you partition by the itemCategory column, this data will be stored in the file structure and not in the actual csv files. Spark infer the datatype depending on the values, if all values are integers then the column type will be int.

One simple solution would be to cast the column to StringType after reading the data:

import spark.implicits._
df.withColumn("itemCategory", $"itemCategory".cast(StringType))

Another option would be to duplicate the column itself. Then one of the columns will be used for the partitioning and, hence, be saved in the file structure. However, the other duplicated column would be saved normally in the parquet file. To make a duplicate simply use:

df.withColumn("itemCategoryCopy", $"itemCategory")

Upvotes: 0

Related Questions