Reputation: 113
i have a dataframe (df) like this:
col1 | col2 | col3 |
---|---|---|
One | Two | x |
One | Two | full |
One | Two | y |
One | Two | z |
One | Two | full |
One | Two | u |
One | Two | e |
Using PySPark i want to mark the element/rows immediately after col3=="full" with 1 otherwise 0, like this:
col1 | col2 | col3 | flag |
---|---|---|---|
One | Two | x | 0 |
One | Two | full | 0 |
One | Two | y | 1 |
One | Two | z | 0 |
One | Two | full | 0 |
One | Two | u | 1 |
One | Two | e | 0 |
At the moment this is my idea, but i'm not taking the row immediately after...:
df.withColumn('flag',f.when(f.col('CD_OPERAZIONE')=='full',1).otherwise(0))
can you help me?
Upvotes: 0
Views: 70
Reputation: 26676
Use lag and when statement
w= Window.partitionBy('col1','col2').orderBy('col1')
df.withColumn('x', when(lag('col3').over(w)=='full',1).otherwise(0)).show()
+----+----+----+---+
|col1|col2|col3| x|
+----+----+----+---+
| One| Two| x| 0|
| One| Two|full| 0|
| One| Two| y| 1|
| One| Two| z| 0|
| One| Two|full| 0|
| One| Two| u| 1|
| One| Two| e| 0|
+----+----+----+---+
Upvotes: 1
Reputation: 110
Step 1: assign row number to each row using row_number function Step 2: filter the dataframe with col3==full, now you have the row numbers where col3 is full, call it dataframe2 lets say step 3: create a new column adding one to the row number column in dataframe2, now you will have row numbers of the immediate row next to the ones having col3 as full step 4: join dataframe one with dataframe2 via inner join after selecting the new column from dataframe2 on row_number from dataframe1 and new row number column on dataframe2.
Pardon for no code, on my mobile. Let me know if you want help still.
Upvotes: 0