Pyspark: Under the same id, seperate into two groups

Question

I am looking into generating two groups under the same ID. The example table is as below:

The input is like this:

The expected output is:

I tried to use dense_rank(), but it could not follow the timestamp order. Any help is appreciated!

Vaebhav · Accepted Answer

A quick solution can be to utilize the TimeStamp value , generating Min, Max and Median to generate the dividing segment to populate 2 groups

If you want to divide into 2 equal halfs based on timestamp

Example -


input_list = [
               (1,None,111)    
               ,(1,None,120)
              ,(1,None,121)
              ,(1,None,124)
              ,(1,'p1',125)
              ,(1,None,126)
              ,(1,None,146)
              ,(1,None,147)
             ]

sparkDF = sql.createDataFrame(input_list,['id','p_id','timestamp'])

grp_window = Window.partitionBy('ID')
median_percentile = F.expr('percentile_approx(timestamp, 0.5)')


sparkDF = sparkDF.withColumn('min_timestamp',F.min(F.col('timestamp')).over(grp_window))\
                 .withColumn('median_value', median_percentile.over(grp_window))\
                 .withColumn('max_timestamp',F.max(F.col('timestamp')).over(grp_window))
                 
    
cond = (
     ( F.col('min_timestamp') <= F.col('timestamp') )
   & ( F.col('timestamp') <= F.col('median_value') + 1)
)


sparkDF = sparkDF.withColumn('p_id',F.when(cond,'p1').otherwise('p2'))

sparkDF.show()

+---+----+---------+-------------+------------+-------------+
| id|p_id|timestamp|min_timestamp|median_value|max_timestamp|
+---+----+---------+-------------+------------+-------------+
|  1|  p1|      111|          111|         124|          147|
|  1|  p1|      120|          111|         124|          147|
|  1|  p1|      121|          111|         124|          147|
|  1|  p1|      124|          111|         124|          147|
|  1|  p1|      125|          111|         124|          147|
|  1|  p2|      126|          111|         124|          147|
|  1|  p2|      146|          111|         124|          147|
|  1|  p2|      147|          111|         124|          147|
+---+----+---------+-------------+------------+-------------+

Pyspark: Under the same id, seperate into two groups

Answers (1)

Example -

Related Questions