grouping values that are sequence in two columns pyspark

Question

I have the follow df:

index      initial_range      final_range
1            1000000              5999999
2            6000000              6299999
3            6300000              6399999
4            6400000              6499999
5            6600000              6699999
6            6700000              6749999
7            6750000              6799999
8            7000000              7399999
9            7600000              7699999
10           7700000              7749999
11           7750000              7799999
12           6500000              6549999

See that the 'initial_range' field and 'final_range' field are intervals of abrangency. When we compare row index 1 and index 2, we observe that the end of the value of the 'final_range' field starts in the next one as sequence+1 in the 'initial_range' index 2. So, in the example ended in 5999999 and started in 6000000 in index 2. I need grouping this cases and return the follow df:

index      initial_range      final_range       grouping
1            1000000              5999999    1000000-6549999
2            6000000              6299999    1000000-6549999
3            6300000              6399999    1000000-6549999
4            6400000              6499999    1000000-6549999
5            6600000              6699999    6600000-6799999
6            6700000              6749999    6600000-6799999
7            6750000              6799999    6600000-6799999
8            7000000              7399999    7000000-7399999
9            7600000              7699999    7600000-7799999
10           7700000              7749999    7600000-7799999
11           7750000              7799999    7600000-7799999
12           6500000              6549999    1000000-6549999

See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken.

Some details:

The index 4 for 5 the sequence+1 is broken, so the new 'grouping' change. In other words, every time the sequence is broken a new sequence needs to be written.
In index 12 the grouping 1000000-6549999 appear again, because the 6500000 is the next number of 6499999 in index 4.

I tried this code:

comparison = df == df.shift()+1
df['grouping'] = comparison['initial_range'] & comparison['final_range']

But, the logic sequence, don't worked.

Can anyone help me?

Tushar Patil · Accepted Answer

Well this was a tough one, here is my answer,

First of all, I am using UDF so expect the performance to be a little bad,

import copy
import pyspark.sql.functions as F
from pyspark.sql.types import *

rn = 0

def check_vals(x, y):
    global rn
    
    if (y != None) and (int(x)+1) == int(y):
        return rn + 1
    else:
        # Using copy to deepcopy and not forming a shallow one.
        res = copy.copy(rn)
        # Increment so that the next value with start form +1
        rn += 1
        # Return the same value as we want to group using this
        return res + 1
    
    return 0

rn_udf = F.udf(lambda x, y: check_vals(x, y), IntegerType())

Next,

from pyspark.sql.window import Window

# We want to check the final_range values according to the initial_value 
w = Window().orderBy(F.col('initial_range'))

# First of all take the next row values of initial range in a column called nextRange so that we can compare
# Check if the final_range+1 == nextRange, if yes use rn value, if not then use rn and increment it for the next iteration.
# Now find the max and min values in the partition created by the check_1 column.
# Concat min and max values
# order it by ID to get the initial ordering, I have to cast it to integer but you might not need it
# drop all calculated values
df.withColumn('nextRange', F.lead('initial_range').over(w)) \
    .withColumn('check_1', rn_udf("final_range", "nextRange")) \
    .withColumn('min_val', F.min("initial_range").over(Window.partitionBy("check_1"))) \
    .withColumn('max_val', F.max("final_range").over(Window.partitionBy("check_1"))) \
    .withColumn('range', F.concat("min_val", F.lit("-"), "max_val")) \
    .orderBy(F.col("ID").cast(IntegerType())) \
    .drop("nextRange", "check_1", "min_val", "max_val") \
    .show(truncate=False)

Output:

+---+-------------+-----------+---------------+
|ID |initial_range|final_range|range          |
+---+-------------+-----------+---------------+
|1  |1000000      |5999999    |1000000-6549999|
|2  |6000000      |6299999    |1000000-6549999|
|3  |6300000      |6399999    |1000000-6549999|
|4  |6400000      |6499999    |1000000-6549999|
|5  |6600000      |6699999    |6600000-6799999|
|6  |6700000      |6749999    |6600000-6799999|
|7  |6750000      |6799999    |6600000-6799999|
|8  |7000000      |7399999    |7000000-7399999|
|9  |7600000      |7699999    |7600000-7799999|
|10 |7700000      |7749999    |7600000-7799999|
|11 |7750000      |7799999    |7600000-7799999|
|12 |6500000      |6549999    |1000000-6549999|
+---+-------------+-----------+---------------+

grouping values that are sequence in two columns pyspark

Answers (1)

Related Questions