Reputation: 1387
How do I do this in SQL/Hive?
columnA columnB columnC
100.10 50.60 30
100.10 50.60 30
100.10 50.60 20
100.10 70.80 40
Output should be:
columnA columnB No_of_distinct_colC
100.10 50.60 2
100.10 70.80 1
Query that I think is correct:
SELECT columnA,columnB,COUNT(distinct column C)
from table_name
group by columnA,columnB
Upvotes: 17
Views: 125925
Reputation: 129
the following code should work if you are using PySpark:
import pyspark.sql.functions as F
spark.sql('select * from table_name')\
.groupby(columnA, columnB)\
.agg(F.countDistinct('columnC') ).show()
Upvotes: 0
Reputation: 53
SELECT *
FROM
(
SELECT columnA, columnB, COUNT(DISTINCT column C) AS dis_col
FROM table_name
GROUP BY columnA, columnB
) A;
Upvotes: -3
Reputation: 2930
Yes, it is almost correct. But you have one simple mistake. Your column name is wrong inside COUNT.
SELECT columnA,columnB,COUNT(DISTINCT columnC) No_of_distinct_colC
from table_name
group by columnA,columnB
Upvotes: 37