RushHour
RushHour

Reputation: 613

Converting SQL query to Spark Dataframe structured data processing

I want to convert the below query in spark dataframe:

sqlContext.sql("SELECT d.dep_name,count(*) FROM employees e,department d WHERE e.dep_id = d.dep_id GROUP BY d.dep_name HAVING count(*) >= 2").show  

Output:

+---------+---+                                                                 
| dep_name|_c1|
+---------+---+
|  FINANCE|  3|
|    AUDIT|  5|
|MARKETING|  6|

I tried it using below query:

scala> finalEmployeesDf.as("df1").join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id").select($"dep_name").groupBy($"dep_name").count.show()
+---------+-----+                                                               
| dep_name|count|
+---------+-----+
|  FINANCE|    3|
|    AUDIT|    5|
|MARKETING|    6|
+---------+-----+  

I know that this isn't correct coz suppose we have a case where we have only single entry for department then it will be also listed in these results but I want results to be displayed only if counts are greater than 2. So how can I achieve this ??? I tried googling but of no help in this case.

Upvotes: 1

Views: 7372

Answers (2)

antonpuz
antonpuz

Reputation: 3316

You have the group and aggregate parts wrong. You need to select all the relevant columns, group by and aggregate by the relevant once. Here is untested code which will represent the correct approach:

finalEmployeesDf.as("df1")
 .join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id")
 .select($"dep_name")
 .groupBy($"dep_name")
 .agg(count($"dep_name").as("cnt"))
 .filter($"cnt" > 2)
 .show()

A general suggestion is to try and break the API calls into several lines, this makes reading and understanding much easier.

Upvotes: 1

Ged
Ged

Reputation: 18023

Try something like this:

DF.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)

Upvotes: 1

Related Questions