Reputation: 41
I've PySpark dataframe df
data = {'Passenger-Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},'Age': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35}}
df_pd = pd.DataFrame(data, columns=data.keys())
df = spark.createDataFrame(df_pd)
+------------+---+
|Passenger-Id|Age|
+------------+---+
| 1| 22|
| 2| 38|
| 3| 26|
| 4| 35|
| 5| 35|
+------------+---+
This works
df.filter(df.Age == 22).show()
But below doesn't work, due to - in the column name
df.filter(df.Passenger-Id == 2).show()
AttributeError: 'DataFrame' object has no attribute 'Passenger'
I'm facing same issue in spark sql too,
spark.sql("SELECT Passenger-Id FROM AutoMobile").show()
spark.sql("SELECT automobile.Passenger-Id FROM AutoMobile").show()
Getting below error
AnalysisException: cannot resolve 'Passenger
' given input columns: [automobile.Age, automobile.Passenger-Id]
Tried giving the column name with in single quote, as advised in some sources, now it just prints column mentioned in query
spark.sql("SELECT 'Passenger-Id' FROM AutoMobile").show()
+------------+
|Passenger-Id|
+------------+
|Passenger-Id|
|Passenger-Id|
|Passenger-Id|
|Passenger-Id|
|Passenger-Id|
+------------+
Upvotes: 0
Views: 2934
Reputation: 509
The below worked for me, double quotes inside a single quote.
import pyspark.sql.functions as F
df.filter(F.col('"Passenger-Id"')== 2).show()
Upvotes: -1
Reputation: 20445
Since you have hiphen in column name, I suggest you to use col()
function from sql.functions
import pyspark.sql.functions as F
df.filter(F.col('Passenger-Id')== 2).show()
Here is the result
+------------+---+
|Passenger-Id|Age|
+------------+---+
| 2| 38|
+------------+---+
Noe for sql syntax, you need to use special character " ` " not single quote, like below
df.createOrReplaceTempView("AutoMobile")
spark.sql("SELECT * FROM AutoMobile where `Passenger-Id`=2").show()
Upvotes: 6