Reputation: 682

create start and end date columns seperately from a single date column

I have the date columns contains two dates of all the DepartmentID, one is start_date and another one is end_date. The output will have two columns for start and end date. I want to implement using SQL window function or Spark Dataframe.

Input

Employee ID      Date           DepartmentID    SupervisorID
10001            20130101          001             10009
10001            20130909          001             10019
10001            20131201          002             10018
10001            20140501          002             10017
10001            20141001          003             10015
10001            20141201          003             10014

Expected Output

Employee ID    DateStart    DateEnd      DepartmentID
10001         20130101      20131201       001
10001         20131201      20141001       002
10001         20141001       Null          003

Upvotes: 1

Answers (2)

Oli

Reputation: 10406

Not sure why DateEnd for Department 003 is Null. It should be 20141201 from my understanding of your problem. Here is a scala version using a group by and an aggregation. The min is the starting date, the max the end date and we set the end to null in case there is only one row.

df
    .groupBy("DepartmentID", "Employee ID")
    .agg(min('Date) as "DateStart", max('Date) as "DateEnd", count('*) as "count")
    .withColumn("DateEnd", when('count > 1, 'DateEnd))
    .drop("count").show(false)
+------------+-----------+---------+--------+
|DepartmentID|Employee ID|DateStart|DateEnd |
+------------+-----------+---------+--------+
|002         |10001      |20131201 |20140501|
|003         |10001      |20141001 |20141201|
|001         |10001      |20130101 |20130909|
+------------+-----------+---------+--------+

Upvotes: 0

过过招

Reputation: 4234

Suppose you register the dataframe as a temporary view named 'tmp', and run the following SQL to get the expected result.

    select EmployeeID,DateStart,
        lead(DateStart) over (order by DateStart) DateEnd,DepartmentID
    from
        (select EmployeeID,min(Date) DateStart,DepartmentID
        from tmp
        group by EmployeeID,DepartmentID)

Upvotes: 1

create start and end date columns seperately from a single date column

Answers (2)

Related Questions