Reputation: 1412
I am trying to select the row with a required value in a column using pyspark sql but it gives me error
Below is my table session
session_id status session_desc
---------- ------ -----------------
session1 Old first_description
session1 Active last_description
session1 Old next_description
session1 Active inter_description
session2 Old next_description
session2 Old inter_description
Below is my spark sql query
spark.sql("select session_id, (CASE WHEN status='Active' THEN session_desc END) AS session_description from session group by session_id").show()
But I am getting below error
org.apache.spark.sql.AnalysisException: expression 'session.status' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
I need as below
session_id session_description
---------- -------------------
session1 last_description # can be inter_description as well (I don't care)
session2 null
Upvotes: 2
Views: 1000
Reputation: 31540
Use your case statement inside subquery
then group by on outer query.
Example:
df.show()
#+----------+------+-----------------+
#|session_id|status| session_Desc|
#+----------+------+-----------------+
#| session1| Old|first_description|
#| session1|Active| last_description|
#| session1| Old| next_description|
#| session1|Active|inter_description|
#| session2| Old| next_description|
#| session2| Old|inter_description|
#+----------+------+-----------------+
spark.sql("select session_id,last(session_desc)session_description from (\
select session_id,case when status='Active' THEN session_desc END as session_desc from tmp)t \
group by session_id").\
show()
#+----------+-------------------+
#|session_id|session_description|
#+----------+-------------------+
#| session1| inter_description|
#| session2| null|
#+----------+-------------------+
Upvotes: 3