Reputation: 613
I want to convert
the below query
in spark dataframe
:
sqlContext.sql("SELECT d.dep_name,count(*) FROM employees e,department d WHERE e.dep_id = d.dep_id GROUP BY d.dep_name HAVING count(*) >= 2").show
Output:
+---------+---+
| dep_name|_c1|
+---------+---+
| FINANCE| 3|
| AUDIT| 5|
|MARKETING| 6|
I tried it using below query:
scala> finalEmployeesDf.as("df1").join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id").select($"dep_name").groupBy($"dep_name").count.show()
+---------+-----+
| dep_name|count|
+---------+-----+
| FINANCE| 3|
| AUDIT| 5|
|MARKETING| 6|
+---------+-----+
I know that this isn't correct
coz suppose we have a case where we have only single entry for department then it will be also listed in these results but I want results to be displayed only if counts are greater than 2
. So how can I achieve this ??? I tried googling but of no help in this case.
Upvotes: 1
Views: 7372
Reputation: 3316
You have the group and aggregate parts wrong. You need to select all the relevant columns, group by and aggregate by the relevant once. Here is untested code which will represent the correct approach:
finalEmployeesDf.as("df1")
.join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id")
.select($"dep_name")
.groupBy($"dep_name")
.agg(count($"dep_name").as("cnt"))
.filter($"cnt" > 2)
.show()
A general suggestion is to try and break the API calls into several lines, this makes reading and understanding much easier.
Upvotes: 1
Reputation: 18023
Try something like this:
DF.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)
Upvotes: 1