Reputation: 5729
Hello I have created a Spark Dataframe reading from a parquet file which looks like this
+-------+-----+-----+-----+-----+
| col-a|col-b|col-c|col-d|col-e|
+-------+-----+-----+-----+-----+
| .....|.....|.....|.....|.....|
+-------+-----+-----+-----+-----+
I want to do a group by using col-a
and col-b
and then find out how many groups have more than 1 unique row. For example, if we have rows
a | b | x | y | z |
a | b | x | y | z |
c | b | m | n | p |
c | b | m | o | r |
I want to find out where count > 1 from
col-a | col-b | count |
a | b | 1 |
c | b | 2 |
I wrote this query
"SELECT `col-a`, `col-b`, count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count \
FROM df GROUP BY `col-a`, `col-b` WHERE count > 1"
This actually generates an error mismatched input 'WHERE' expecting {<EOF>, ';'}
. Can anyone help me what I am doing wrong here?
I am currently doing it like
df_2 = spark.sql("SELECT `col-a`, `col-b`, count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count FROM df GROUP BY `col-a`, `col-b`")
df_2.createOrReplaceTempView("df_2")
spark.sql("SELECT * FROM df_2 WHERE `count` > 1").show()
Upvotes: 1
Views: 960
Reputation: 39314
Filtering after a group by
requires a having
clause. where
has to be before group by.
It might work like this. You may also be able to refer to the complex column by index (not sure if that works in spark). Would keep it cleaner.
SELECT
`col-a`,
`col-b`,
count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count \
FROM df
GROUP BY `col-a`, `col-b`
HAVING count(DISTINCT (`col-c`, `col-d`, `col-e`)) > 1
Upvotes: 2