Anirban Chakraborty
Anirban Chakraborty

Reputation: 781

grouping() shows behaviour in pyspark not consistent with oracle

I am using the following view (dfview) -

+----+-----+-------+
|roll|grade|subject|
+----+-----+-------+
|   1|    A|  Maths|
|   1|    A|   Chem|
|   1|    B|    Phy|
|   2|    A|  Maths|
|   2|    B|   Chem|
|   2|    B|    Phy|
+----+-----+-------+

I am running the following query in pyspark -

spark.sql('''
          select
          roll,
            grouping(roll),
            grouping(grade),
            grouping(subject),
            count(*)
          from
            dfview
          group by cube(roll,grade,subject)
          having
            (grouping(roll) = 0
            and count(*) > 1)
''').show()

I am getting an error -

AnalysisException: Column of grouping (roll#19400) can't be found in grouping columns roll#19572,grade#19573,subject#19574

I think this is due to the introduction of the column roll in the select view portion. This sort of query does not produce any error in oracle 19c though. Is it a shortcoming of pyspark 3.1.2, that I am using. A similar discrepancy of pyspark with oracle sql behaviour in relation to grouping, I highlighted in the following question.

Upvotes: 1

Views: 80

Answers (1)

AdibP
AdibP

Reputation: 2939

You need to assign aliases to the columns that are referenced in the having clause

spark.sql(""" select
            roll,
            grouping(roll) as gr_roll,
            grouping(grade),
            grouping(subject),
            count(*) as ct
          from
            dfview
          group by cube(roll,grade,subject)
          having
            gr_roll = 0
            and ct > 1""").show()

# +----+-------+---------------+-----------------+---+
# |roll|gr_roll|grouping(grade)|grouping(subject)| ct|
# +----+-------+---------------+-----------------+---+
# |   1|      0|              1|                1|  3|
# |   2|      0|              0|                1|  2|
# |   1|      0|              0|                1|  2|
# |   2|      0|              1|                1|  3|
# +----+-------+---------------+-----------------+---+

or use backticks like `grouping(roll)` and `count(1)` if you do not want to use aliases

Upvotes: 1

Related Questions