Reputation: 11

create a new column to increment value when value resets to 1 in another column in pyspark

Logic and columnIn Pyspark DataFrame consider a column like [1,2,3,4,1,2,1,1,2,3,1,2,1,1,2]. Pyspark Column create a new column to increment value when value resets to 1. Expected output is[1,1,1,1,2,2,3,4,4,4,5,5,6,7,7]

i am bit new to pyspark, if anyone can help me it would be great for me.

written the logic as like below

def sequence(row_num):
    results = [1, ]
    flag = 1
    for col in range(0, len(row_num)-1):
        if row_num[col][0]>=row_num[col+1][0]:
            flag+=1
        results.append(flag)
    return results

but not able to pass a column through udf. please help me in this

Upvotes: 1

Answers (1)

Luiz Viola

Reputation: 2436

Your Dataframe:

df  = spark.createDataFrame(
  [
('1','a'),
('2','b'),
('3','c'),
('4','d'),
('1','e'),
('2','f'),
('1','g'),
('1','h'),
('2','i'),
('3','j'),
('1','k'),
('2','l'),
('1','m'),
('1','n'),
('2','o')
  ], ['group','label']
)

+-----+-----+
|group|label|
+-----+-----+
|    1|    a|
|    2|    b|
|    3|    c|
|    4|    d|
|    1|    e|
|    2|    f|
|    1|    g|
|    1|    h|
|    2|    i|
|    3|    j|
|    1|    k|
|    2|    l|
|    1|    m|
|    1|    n|
|    2|    o|
+-----+-----+

You can create a flag and use a Window Function to calculate the cumulative sum. No need to use an UDF:

from pyspark.sql import Window as W
from pyspark.sql import functions as F

w = W.partitionBy().orderBy('label').rowsBetween(Window.unboundedPreceding, 0)

df\
    .withColumn('Flag', F.when(F.col('group') == 1, 1).otherwise(0))\
    .withColumn('Output', F.sum('Flag').over(w))\
    .show()

+-----+-----+----+------+
|group|label|Flag|Output|
+-----+-----+----+------+
|    1|    a|   1|     1|
|    2|    b|   0|     1|
|    3|    c|   0|     1|
|    4|    d|   0|     1|
|    1|    e|   1|     2|
|    2|    f|   0|     2|
|    1|    g|   1|     3|
|    1|    h|   1|     4|
|    2|    i|   0|     4|
|    3|    j|   0|     4|
|    1|    k|   1|     5|
|    2|    l|   0|     5|
|    1|    m|   1|     6|
|    1|    n|   1|     7|
|    2|    o|   0|     7|
+-----+-----+----+------+

Upvotes: 1

create a new column to increment value when value resets to 1 in another column in pyspark

Answers (1)

Related Questions