ShaneK
ShaneK

Reputation: 303

Iterating PySpark Dataframe to Populate a Column

I have a LARGE, sorted PySpark dataframe "df", that I need to iterate through and do the following for each row (by index):

If "row['col1'] == nextrow['col1']:
    If nextrow['col3'] == 1:
        thisrow['col4'] == 1

For example given:

# +---+----+----+----+
# |id |col1|col3|col4|
# +---+----+----+----+
# |1  |33  |1   |0   |
# |2  |33  |0   |0   |
# |3  |33  |0   |0   |
# |4  |11  |1   |0   |
# |5  |11  |1   |0   |
# |6  |22  |0   |0   |
# |7  |22  |1   |0   |
# +---+----+----+----+

Would generate:

# +---+----+----+----+
# |id |col1|col3|col4|
# +---+----+----+----+
# |1  |33  |1   |0   |
# |2  |33  |0   |0   |
# |3  |33  |0   |0   |
# |4  |11  |1   |1   |
# |5  |11  |1   |0   |
# |6  |22  |0   |1   |
# |7  |22  |1   |0   |
# +---+----+----+----+

I know Spark dataframes are immutable. What is the best way to do this? I've thought about converting it to an RDD and creating a function for a map+lambda combo, but I do not know how to determine which row I am on without adding an index column.

Upvotes: 0

Views: 53

Answers (1)

Linus
Linus

Reputation: 669

have a try of this method:

d1 = [
    (1, 33, 1, 9),
    (2, 33, 1, 9),
    (3, 22, 1, 9),
    (4, 11, 1, 9),
    (5, 11, 1, 9),
    (6, 22, 1, 9),
    (7, 33, 1, 9),
]

df1 = spark.createDataFrame(d1, ['id', 'col1', 'col3', 'col4'])
df1.show(10, False)
# +---+----+----+----+
# |id |col1|col3|col4|
# +---+----+----+----+
# |1  |33  |1   |9   |
# |2  |33  |1   |9   |
# |3  |22  |1   |9   |
# |4  |11  |1   |9   |
# |5  |11  |1   |9   |
# |6  |22  |1   |9   |
# |7  |33  |1   |9   |
# +---+----+----+----+
df1\
    .withColumn('next_col1', lead(col('col1')).over(Window.partitionBy().orderBy('id')))\
    .withColumn('next_col3', lead(col('col3')).over(Window.partitionBy().orderBy('id')))\
    .withColumn('new_col4', when((col('col1')==col('next_col1')) & (col('next_col3')==1), lit(1)).otherwise(col('col4')))\
.show(10, False)
# +---+----+----+----+---------+---------+--------+
# |id |col1|col3|col4|next_col1|next_col3|new_col4|
# +---+----+----+----+---------+---------+--------+
# |1  |33  |1   |9   |33       |1        |1       |
# |2  |33  |1   |9   |22       |1        |9       |
# |3  |22  |1   |9   |11       |1        |9       |
# |4  |11  |1   |9   |11       |1        |1       |
# |5  |11  |1   |9   |22       |1        |9       |
# |6  |22  |1   |9   |33       |1        |9       |
# |7  |33  |1   |9   |null     |null     |9       |
# +---+----+----+----+---------+---------+--------+

Upvotes: 1

Related Questions