Reputation: 83
I have dataframe in pyspark as below
ID Name add date from date end
1 aaa yyyyyy 20-01-2018 30-01-2018
2 bbb ffffff 02-11-2018 15-11-2018
but looking to get ouput as below
ID Name add date from date end
1 aaa yyyyyy 20-01-2018 30-01-2018
1 aaa yyyyyy 21-01-2018 30-01-2018
1 aaa yyyyyy 22-01-2018 30-01-2018
1 aaa yyyyyy 23-01-2018 30-01-2018
1 aaa yyyyyy 24-01-2018 30-01-2018
1 aaa yyyyyy 25-01-2018 30-01-2018
1 aaa yyyyyy 26-01-2018 30-01-2018
1 aaa yyyyyy 27-01-2018 30-01-2018
1 aaa yyyyyy 28-01-2018 30-01-2018
1 aaa yyyyyy 29-01-2018 30-01-2018
1 aaa yyyyyy 30-01-2018 30-01-2018
2 bbb ffffff 02-11-2018 15-11-2018
2 bbb ffffff 03-11-2018 15-11-2018
2 bbb ffffff 04-11-2018 15-11-2018
2 bbb ffffff 05-11-2018 15-11-2018
2 bbb ffffff 06-11-2018 15-11-2018
2 bbb ffffff 07-11-2018 15-11-2018
2 bbb ffffff 08-11-2018 15-11-2018
2 bbb ffffff 09-11-2018 15-11-2018
2 bbb ffffff 10-11-2018 15-11-2018
2 bbb ffffff 11-11-2018 15-11-2018
2 bbb ffffff 12-11-2018 15-11-2018
2 bbb ffffff 13-11-2018 15-11-2018
2 bbb ffffff 14-11-2018 15-11-2018
2 bbb ffffff 15-11-2018 15-11-2018
Upvotes: 1
Views: 1811
Reputation: 3
To perform same in Azure data factory Dataflow, follow below logic.
Add a derived column activity in dataflow, This will calculate number of months between start and end date
derived_col1=toInteger(round(monthsBetween(end_date,st_date)))
add another derived column activity, this will return array having all the months between 2 dates
derived_col2= mapLoop(derived_col1+1,toDate(addMonths(st_date, (toInteger(#index)-1))))
another derived activity, this is similar to explode operation which will return you multiple rows
month_column=unfold(derived_col2)
All this can be fit into 1 derived column activity but just to explain the logic to you, writing in multiple steps
Upvotes: 0
Reputation: 784
Try this out:
a = [(1,'aaa','yyyyyy','20-01-2018','30-01-2018'),
(2,'bbb','ffffff','02-11-2018','15-11-2018')]
df = spark.createDataFrame(a,["ID","Name","add","date_from","date_end"])
df.show()
+---+----+------+----------+----------+
| ID|Name| add| date_from| date_end|
+---+----+------+----------+----------+
| 1| aaa|yyyyyy|20-01-2018|30-01-2018|
| 2| bbb|ffffff|02-11-2018|15-11-2018|
+---+----+------+----------+----------+
df.registerTempTable("temp")
result = sqlContext.sql("""
select t.ID,
t.Name,
t.add,
date_format(date_add(to_date(t.date_from,'dd-MM-yyyy'),pe.i),'dd-MM-yyyy') as date_from,
t.date_end
from temp t
lateral view posexplode(split(space(datediff(to_date(t.date_end,'dd-MM-yyyy'),to_date(t.date_from,'dd-MM-yyyy'))),' ')) pe as i,x
""")
result.show()
+---+----+------+----------+----------+
| ID|Name| add| date_from| date_end|
+---+----+------+----------+----------+
| 1| aaa|yyyyyy|20-01-2018|30-01-2018|
| 1| aaa|yyyyyy|21-01-2018|30-01-2018|
| 1| aaa|yyyyyy|22-01-2018|30-01-2018|
| 1| aaa|yyyyyy|23-01-2018|30-01-2018|
| 1| aaa|yyyyyy|24-01-2018|30-01-2018|
| 1| aaa|yyyyyy|25-01-2018|30-01-2018|
| 1| aaa|yyyyyy|26-01-2018|30-01-2018|
| 1| aaa|yyyyyy|27-01-2018|30-01-2018|
| 1| aaa|yyyyyy|28-01-2018|30-01-2018|
| 1| aaa|yyyyyy|29-01-2018|30-01-2018|
| 1| aaa|yyyyyy|30-01-2018|30-01-2018|
| 2| bbb|ffffff|02-11-2018|15-11-2018|
| 2| bbb|ffffff|03-11-2018|15-11-2018|
| 2| bbb|ffffff|04-11-2018|15-11-2018|
| 2| bbb|ffffff|05-11-2018|15-11-2018|
| 2| bbb|ffffff|06-11-2018|15-11-2018|
| 2| bbb|ffffff|07-11-2018|15-11-2018|
| 2| bbb|ffffff|08-11-2018|15-11-2018|
| 2| bbb|ffffff|09-11-2018|15-11-2018|
| 2| bbb|ffffff|10-11-2018|15-11-2018|
+---+----+------+----------+----------+
apache-spark pyspark apache-spark-sql
Upvotes: 4