Reputation: 1245
I have the following Pyspark dataframe:
id1 id2 date col1 col2
1 1 2022-W01 5 10
1 2 2022-W02 2 5
1 3 2022-W03 3 8
1 5 2022-W05 5 3
2 2 2022-W03 2 2
2 6 2022-W05 4 1
2 8 2022-W07 3 2
I would like to fill the missing dates for each id1 and obtain something like this:
id1 id2 date col1 col2
1 1 2022-W01 5 10
1 2 2022-W02 2 5
1 3 2022-W03 3 8
1 NA 2022-W04 NA NA
1 5 2022-W05 5 3
2 2 2022-W03 2 2
2 NA 2022-W04 NA NA
2 6 2022-W05 4 1
12 NA 2022-W06 NA NA
2 8 2022-W07 3 2
I started with this code:
df.groupby('id').agg(F.expr('max(date)').alias('max_date'),F.expr('min(date)').alias('min_date'))\
.withColumn('date',F.expr("explode(sequence(min_date,max_date,interval 1 week))"))\
.drop('max_date','min_date')
)
The main problem is that my date is in a particular format '2022-W01'. I couldn't find a quick solution for it
Upvotes: 0
Views: 173
Reputation: 26686
r= regexp_extract('date','\d$',0 )
w=Window.partitionBy('id1')
new = (
df.withColumn('y', min(r).over(w).astype('int'))
.withColumn('x', max(r).over(w).astype('int'))
#extract trailing digits in date, use min and max to create sequence, use array except to find missing dates' digits
.withColumn('z', array_except(sequence(col('y'), col('x')),collect_list(r.astype('int')).over(w)))
# explode column generated above
.withColumn('z',explode('z'))
#concat missing digits and dates to create new dates
.withColumn('date',concat(regexp_replace('date', '\d$',''),col('z')))
# select required columns
.select('id1','date')
# drop duplicates
.dropDuplicates()
)
#Create new df by appending outcome of above to existing df and sorting
df.unionByName(new, allowMissingColumns=True).sort('id1','date').show()
+---+----+--------+----+----+
|id1| id2| date|col1|col2|
+---+----+--------+----+----+
| 1| 1|2022-W01| 5| 10|
| 1| 2|2022-W02| 2| 5|
| 1| 3|2022-W03| 3| 8|
| 1|null|2022-W04|null|null|
| 1| 5|2022-W05| 5| 3|
| 2| 2|2022-W03| 2| 2|
| 2|null|2022-W04|null|null|
| 2| 6|2022-W05| 4| 1|
| 2|null|2022-W06|null|null|
| 2| 8|2022-W07| 3| 2|
+---+----+--------+----+----+
Upvotes: 1