sfarls
sfarls

Reputation: 25

Fill rows with previous value when value = 0

I have a spark dataframe similar to the one below:

OLD DATAFRAME
+----+-------+
| ID | count |
+----+-------+
| 0  | 12000 |
| 1  | 12000 |
| 2  | 12000 |
| 3  | 12000 |
| 4  | 0     |
| 5  | 0     |
| 6  | 0     |
| 7  | 0     |
| 8  | 1400  |
| 9  | 1400  |
+----+-------+

I need to fill the 0 values with the most recent non-zero value in the column. Example:

TARGET DATAFRAME
+----+-------+
| ID | count |
+----+-------+
| 0  | 12000 |
| 1  | 12000 |
| 2  | 12000 |
| 3  | 12000 |
| 4  | 12000 |
| 5  | 12000 |
| 6  | 12000 |
| 7  | 12000 |
| 8  | 1400  |
| 9  | 1400  |
+----+-------+

How do I go about doing this?

Upvotes: 0

Views: 238

Answers (2)

ZygD
ZygD

Reputation: 24356

This is the easiest way I can think of.

from pyspark.sql import functions as F, Window as W

df = df.withColumn(
    'count',
    F.last(F.when(F.col('count') != 0, F.col('count')), True).over(W.orderBy('ID'))
)

Upvotes: 1

Anjaneya Tripathi
Anjaneya Tripathi

Reputation: 1459

By using last we can acheive the required result

s2=StringIO("""
ID|count
0|13000
1|11000
2|14000
3|12000
4|0    
5|0    
6|0    
7|0    
8|1400 
9|1400 """)

dfp2=pd.read_csv(s2,sep='|')
dfs2=spark.createDataFrame(dfp2)
dfs2.withColumn("count",F.when(F.col("count")==0,None).otherwise(F.col("count"))).withColumn("count",F.when(F.col("count").isNull(),F.last(F.col("count"),True).over(Window.orderBy("ID"))).otherwise(F.col("count"))).show()

#output
+---+-----+
| ID|count|
+---+-----+
|  0|13000|
|  1|11000|
|  2|14000|
|  3|12000|
|  4|12000|
|  5|12000|
|  6|12000|
|  7|12000|
|  8| 1400|
|  9| 1400|
+---+-----+

Changed some values in input to show that only the last value is being copied below

First I changed 0 to nulls then wherever there was a null value in the column I picked last, not null value ordered by id.

Upvotes: 0

Related Questions