Reputation: 303
I have a LARGE, sorted PySpark dataframe "df", that I need to iterate through and do the following for each row (by index):
If "row['col1'] == nextrow['col1']:
If nextrow['col3'] == 1:
thisrow['col4'] == 1
For example given:
# +---+----+----+----+
# |id |col1|col3|col4|
# +---+----+----+----+
# |1 |33 |1 |0 |
# |2 |33 |0 |0 |
# |3 |33 |0 |0 |
# |4 |11 |1 |0 |
# |5 |11 |1 |0 |
# |6 |22 |0 |0 |
# |7 |22 |1 |0 |
# +---+----+----+----+
Would generate:
# +---+----+----+----+
# |id |col1|col3|col4|
# +---+----+----+----+
# |1 |33 |1 |0 |
# |2 |33 |0 |0 |
# |3 |33 |0 |0 |
# |4 |11 |1 |1 |
# |5 |11 |1 |0 |
# |6 |22 |0 |1 |
# |7 |22 |1 |0 |
# +---+----+----+----+
I know Spark dataframes are immutable. What is the best way to do this? I've thought about converting it to an RDD and creating a function for a map+lambda combo, but I do not know how to determine which row I am on without adding an index column.
Upvotes: 0
Views: 53
Reputation: 669
have a try of this method:
d1 = [
(1, 33, 1, 9),
(2, 33, 1, 9),
(3, 22, 1, 9),
(4, 11, 1, 9),
(5, 11, 1, 9),
(6, 22, 1, 9),
(7, 33, 1, 9),
]
df1 = spark.createDataFrame(d1, ['id', 'col1', 'col3', 'col4'])
df1.show(10, False)
# +---+----+----+----+
# |id |col1|col3|col4|
# +---+----+----+----+
# |1 |33 |1 |9 |
# |2 |33 |1 |9 |
# |3 |22 |1 |9 |
# |4 |11 |1 |9 |
# |5 |11 |1 |9 |
# |6 |22 |1 |9 |
# |7 |33 |1 |9 |
# +---+----+----+----+
df1\
.withColumn('next_col1', lead(col('col1')).over(Window.partitionBy().orderBy('id')))\
.withColumn('next_col3', lead(col('col3')).over(Window.partitionBy().orderBy('id')))\
.withColumn('new_col4', when((col('col1')==col('next_col1')) & (col('next_col3')==1), lit(1)).otherwise(col('col4')))\
.show(10, False)
# +---+----+----+----+---------+---------+--------+
# |id |col1|col3|col4|next_col1|next_col3|new_col4|
# +---+----+----+----+---------+---------+--------+
# |1 |33 |1 |9 |33 |1 |1 |
# |2 |33 |1 |9 |22 |1 |9 |
# |3 |22 |1 |9 |11 |1 |9 |
# |4 |11 |1 |9 |11 |1 |1 |
# |5 |11 |1 |9 |22 |1 |9 |
# |6 |22 |1 |9 |33 |1 |9 |
# |7 |33 |1 |9 |null |null |9 |
# +---+----+----+----+---------+---------+--------+
Upvotes: 1