Reputation: 781
I am using the following view (dfview
) -
+----+-----+-------+
|roll|grade|subject|
+----+-----+-------+
| 1| A| Maths|
| 1| A| Chem|
| 1| B| Phy|
| 2| A| Maths|
| 2| B| Chem|
| 2| B| Phy|
+----+-----+-------+
I am running the following query in pyspark
-
spark.sql('''
select
roll,
grouping(roll),
grouping(grade),
grouping(subject),
count(*)
from
dfview
group by cube(roll,grade,subject)
having
(grouping(roll) = 0
and count(*) > 1)
''').show()
I am getting an error -
AnalysisException: Column of grouping (roll#19400) can't be found in grouping columns roll#19572,grade#19573,subject#19574
I think this is due to the introduction of the column roll
in the select view portion. This sort of query does not produce any error in oracle 19c
though. Is it a shortcoming of pyspark 3.1.2
, that I am using. A similar discrepancy of pyspark
with oracle sql
behaviour in relation to grouping
, I highlighted in the following question.
Upvotes: 1
Views: 80
Reputation: 2939
You need to assign aliases to the columns that are referenced in the having
clause
spark.sql(""" select
roll,
grouping(roll) as gr_roll,
grouping(grade),
grouping(subject),
count(*) as ct
from
dfview
group by cube(roll,grade,subject)
having
gr_roll = 0
and ct > 1""").show()
# +----+-------+---------------+-----------------+---+
# |roll|gr_roll|grouping(grade)|grouping(subject)| ct|
# +----+-------+---------------+-----------------+---+
# | 1| 0| 1| 1| 3|
# | 2| 0| 0| 1| 2|
# | 1| 0| 0| 1| 2|
# | 2| 0| 1| 1| 3|
# +----+-------+---------------+-----------------+---+
or use backticks like `grouping(roll)`
and `count(1)`
if you do not want to use aliases
Upvotes: 1