Converting SQL query to Spark Dataframe structured data processing

Question

I want to convert the below query in spark dataframe:

sqlContext.sql("SELECT d.dep_name,count(*) FROM employees e,department d WHERE e.dep_id = d.dep_id GROUP BY d.dep_name HAVING count(*) >= 2").show

Output:

+---------+---+                                                                 
| dep_name|_c1|
+---------+---+
|  FINANCE|  3|
|    AUDIT|  5|
|MARKETING|  6|

I tried it using below query:

scala> finalEmployeesDf.as("df1").join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id").select($"dep_name").groupBy($"dep_name").count.show()
+---------+-----+                                                               
| dep_name|count|
+---------+-----+
|  FINANCE|    3|
|    AUDIT|    5|
|MARKETING|    6|
+---------+-----+

I know that this isn't correct coz suppose we have a case where we have only single entry for department then it will be also listed in these results but I want results to be displayed only if counts are greater than 2. So how can I achieve this ??? I tried googling but of no help in this case.

antonpuz · Accepted Answer

You have the group and aggregate parts wrong. You need to select all the relevant columns, group by and aggregate by the relevant once. Here is untested code which will represent the correct approach:

finalEmployeesDf.as("df1")
 .join(depDf.as("df2"), $"df1.dep_id" === $"df2.dep_id")
 .select($"dep_name")
 .groupBy($"dep_name")
 .agg(count($"dep_name").as("cnt"))
 .filter($"cnt" > 2)
 .show()

A general suggestion is to try and break the API calls into several lines, this makes reading and understanding much easier.

Converting SQL query to Spark Dataframe structured data processing

Answers (2)

Related Questions