Reputation: 4139
From some brief testing, it appears that the column drop function for pyspark dataframes is not case sensitive, eg.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import sys
sparkSession = SparkSession.builder.appName("my-session").getOrCreate()
dff = sparkSession.createDataFrame([(10,123), (14,456), (16,678)], ["age", "AGE"])
>>> dff.show()
+---+---+
|age|AGE|
+---+---+
| 10|123|
| 14|456|
| 16|678|
+---+---+
>>> dff.drop("AGE")
DataFrame[]
>>> dff_dropped = dff.drop("AGE")
>>> dff_dropped.show()
++
||
++
||
||
||
++
"""
What I'd like to see here is:
+---+
|age|
+---+
| 10|
| 14|
| 16|
+---+
"""
Is there a way to drop dataframe columns in a case sensitive way? (Have seen some comments related to something like this in spark JIRA discussions, but was looking for something at only applied to the drop()
operation in an ad hoc way (not a global / persistent setting)).
Upvotes: 3
Views: 2897
Reputation: 1708
#Add this before using drop
sqlContext.sql("set spark.sql.caseSensitive=true")
You need to set casesensitivity as true if you have two columns having same name
Upvotes: 6