Using group_by where with Spark SQL

Question

Hello I have created a Spark Dataframe reading from a parquet file which looks like this

+-------+-----+-----+-----+-----+
|  col-a|col-b|col-c|col-d|col-e|
+-------+-----+-----+-----+-----+
|  .....|.....|.....|.....|.....|
+-------+-----+-----+-----+-----+

I want to do a group by using col-a and col-b and then find out how many groups have more than 1 unique row. For example, if we have rows

a | b | x | y | z |
a | b | x | y | z |
c | b | m | n | p |
c | b | m | o | r |

I want to find out where count > 1 from

col-a | col-b | count |
a     | b     | 1     |
c     | b     | 2     |

I wrote this query

"SELECT `col-a`, `col-b`, count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count \
FROM df GROUP BY `col-a`, `col-b` WHERE count > 1"

This actually generates an error mismatched input 'WHERE' expecting {, ';'}. Can anyone help me what I am doing wrong here?

I am currently doing it like

df_2 = spark.sql("SELECT `col-a`, `col-b`, count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count FROM df GROUP BY `col-a`, `col-b`")
df_2.createOrReplaceTempView("df_2")
spark.sql("SELECT * FROM df_2 WHERE `count` > 1").show()

John Humphreys · Accepted Answer

Filtering after a group by requires a having clause. where has to be before group by.

It might work like this. You may also be able to refer to the complex column by index (not sure if that works in spark). Would keep it cleaner.

SELECT
    `col-a`,
    `col-b`,
    count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count \
FROM df
GROUP BY `col-a`, `col-b`
HAVING count(DISTINCT (`col-c`, `col-d`, `col-e`)) > 1

Using group_by where with Spark SQL

Answers (1)

Related Questions