Spark DataFrame duplicate column names while using mergeSchema

Question

I have a huge Spark DataFrame which I create using the following statement

val df = sqlContext.read.option("mergeSchema", "true").parquet("parquet/partitions/path")

Now when I try to do column rename or select operation on above DataFrame it fails saying ambiguous columns found with the following exception

org.apache.spark.sql.AnalysisException: Reference 'Product_Type' is ambiguous, could be Product_Type#13, Product_Type#235

Now I saw columns and found there are two columns Product_Type and Product_type which seems to be same columns with one letter case different created because of schema merge over time. Now I don't mind keeping duplicate columns but Spark sqlContext for some reason don't like it.

I believe by default spark.sql.caseSensitive config is true so don't know why it fails. I am using Spark 1.5.2. I am new to Spark.

Ramesh Maharjan · Accepted Answer

By default, spark.sql.caseSensitive property is false so before your rename or select statement, you should set the property to true

sqlContext.sql("set spark.sql.caseSensitive=true")

Spark DataFrame duplicate column names while using mergeSchema

Answers (1)

Related Questions