Alfonso
Alfonso

Reputation: 113

pyspark get the row immediately after the one selected

i have a dataframe (df) like this:

col1 col2 col3
One Two x
One Two full
One Two y
One Two z
One Two full
One Two u
One Two e

Using PySPark i want to mark the element/rows immediately after col3=="full" with 1 otherwise 0, like this:

col1 col2 col3 flag
One Two x 0
One Two full 0
One Two y 1
One Two z 0
One Two full 0
One Two u 1
One Two e 0

At the moment this is my idea, but i'm not taking the row immediately after...:

df.withColumn('flag',f.when(f.col('CD_OPERAZIONE')=='full',1).otherwise(0))

can you help me?

Upvotes: 0

Views: 70

Answers (2)

wwnde
wwnde

Reputation: 26676

Use lag and when statement

w= Window.partitionBy('col1','col2').orderBy('col1')
df.withColumn('x', when(lag('col3').over(w)=='full',1).otherwise(0)).show()

+----+----+----+---+
|col1|col2|col3|  x|
+----+----+----+---+
| One| Two|   x|  0|
| One| Two|full|  0|
| One| Two|   y|  1|
| One| Two|   z|  0|
| One| Two|full|  0|
| One| Two|   u|  1|
| One| Two|   e|  0|
+----+----+----+---+

Upvotes: 1

Akash
Akash

Reputation: 110

Step 1: assign row number to each row using row_number function Step 2: filter the dataframe with col3==full, now you have the row numbers where col3 is full, call it dataframe2 lets say step 3: create a new column adding one to the row number column in dataframe2, now you will have row numbers of the immediate row next to the ones having col3 as full step 4: join dataframe one with dataframe2 via inner join after selecting the new column from dataframe2 on row_number from dataframe1 and new row number column on dataframe2.

Pardon for no code, on my mobile. Let me know if you want help still.

Upvotes: 0

Related Questions