Umesh Kacha
Umesh Kacha

Reputation: 13686

Spark DataFrame duplicate column names while using mergeSchema

I have a huge Spark DataFrame which I create using the following statement

val df = sqlContext.read.option("mergeSchema", "true").parquet("parquet/partitions/path")

Now when I try to do column rename or select operation on above DataFrame it fails saying ambiguous columns found with the following exception

org.apache.spark.sql.AnalysisException: Reference 'Product_Type' is ambiguous, could be Product_Type#13, Product_Type#235

Now I saw columns and found there are two columns Product_Type and Product_type which seems to be same columns with one letter case different created because of schema merge over time. Now I don't mind keeping duplicate columns but Spark sqlContext for some reason don't like it.

I believe by default spark.sql.caseSensitive config is true so don't know why it fails. I am using Spark 1.5.2. I am new to Spark.

Upvotes: 1

Views: 5096

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

By default, spark.sql.caseSensitive property is false so before your rename or select statement, you should set the property to true

sqlContext.sql("set spark.sql.caseSensitive=true")

Upvotes: 5

Related Questions