Reputation: 682
I have the date columns contains two dates of all the DepartmentID
, one is start_date
and another one is end_date
. The output will have two columns for start and end date. I want to implement using SQL window function or Spark Dataframe.
Input
Employee ID Date DepartmentID SupervisorID
10001 20130101 001 10009
10001 20130909 001 10019
10001 20131201 002 10018
10001 20140501 002 10017
10001 20141001 003 10015
10001 20141201 003 10014
Expected Output
Employee ID DateStart DateEnd DepartmentID
10001 20130101 20131201 001
10001 20131201 20141001 002
10001 20141001 Null 003
Upvotes: 1
Views: 658
Reputation: 10406
Not sure why DateEnd
for Department 003 is Null
. It should be 20141201
from my understanding of your problem. Here is a scala version using a group by and an aggregation. The min is the starting date, the max the end date and we set the end to null in case there is only one row.
df
.groupBy("DepartmentID", "Employee ID")
.agg(min('Date) as "DateStart", max('Date) as "DateEnd", count('*) as "count")
.withColumn("DateEnd", when('count > 1, 'DateEnd))
.drop("count").show(false)
+------------+-----------+---------+--------+
|DepartmentID|Employee ID|DateStart|DateEnd |
+------------+-----------+---------+--------+
|002 |10001 |20131201 |20140501|
|003 |10001 |20141001 |20141201|
|001 |10001 |20130101 |20130909|
+------------+-----------+---------+--------+
Upvotes: 0
Reputation: 4234
Suppose you register the dataframe as a temporary view named 'tmp', and run the following SQL to get the expected result.
select EmployeeID,DateStart,
lead(DateStart) over (order by DateStart) DateEnd,DepartmentID
from
(select EmployeeID,min(Date) DateStart,DepartmentID
from tmp
group by EmployeeID,DepartmentID)
Upvotes: 1