Reputation: 25
I am looking into generating two groups under the same ID. The example table is as below:
The input is like this:
The expected output is:
I tried to use dense_rank(), but it could not follow the timestamp order. Any help is appreciated!
Upvotes: 0
Views: 58
Reputation: 5032
A quick solution can be to utilize the TimeStamp value , generating Min
, Max
and Median
to generate the dividing segment to populate 2 groups
If you want to divide into 2 equal halfs based on timestamp
input_list = [
(1,None,111)
,(1,None,120)
,(1,None,121)
,(1,None,124)
,(1,'p1',125)
,(1,None,126)
,(1,None,146)
,(1,None,147)
]
sparkDF = sql.createDataFrame(input_list,['id','p_id','timestamp'])
grp_window = Window.partitionBy('ID')
median_percentile = F.expr('percentile_approx(timestamp, 0.5)')
sparkDF = sparkDF.withColumn('min_timestamp',F.min(F.col('timestamp')).over(grp_window))\
.withColumn('median_value', median_percentile.over(grp_window))\
.withColumn('max_timestamp',F.max(F.col('timestamp')).over(grp_window))
cond = (
( F.col('min_timestamp') <= F.col('timestamp') )
& ( F.col('timestamp') <= F.col('median_value') + 1)
)
sparkDF = sparkDF.withColumn('p_id',F.when(cond,'p1').otherwise('p2'))
sparkDF.show()
+---+----+---------+-------------+------------+-------------+
| id|p_id|timestamp|min_timestamp|median_value|max_timestamp|
+---+----+---------+-------------+------------+-------------+
| 1| p1| 111| 111| 124| 147|
| 1| p1| 120| 111| 124| 147|
| 1| p1| 121| 111| 124| 147|
| 1| p1| 124| 111| 124| 147|
| 1| p1| 125| 111| 124| 147|
| 1| p2| 126| 111| 124| 147|
| 1| p2| 146| 111| 124| 147|
| 1| p2| 147| 111| 124| 147|
+---+----+---------+-------------+------------+-------------+
Upvotes: 1