Issue writing pyspark or sql query to conditionally count column

Trying to write a query that correctly creates and populates an ID column that counts until it encounters a new non-null value in the Data column, at which point it restarts at 1. Here's an example of the data.

+-------+
| Data  | 
+-------+
|"this" |       
| null  | 
|"that" |  
|"those"| 
| null  |  
| null  |

And here's a sample of what the output should look like:

+-------+---+
| Data  | ID|
+-------+---+
|"this" | 1 |       
| null  | 2 | 
|"that" | 1 | 
|"those"| 1 | 
| null  | 2 |  
| null  | 3 |

I've attempted to use both row_number() and Monotonically_increasing_id() but neither are producing the output that I need.

Any ideas?

Thank you!

Upvotes: 0

Answers (2)

ValNik

Reputation: 5926

See SQL example

create table test(id int auto_increment,Data varchar(10),
  PRIMARY KEY (id)); 
insert into test (data)values
 ("this")       
,(null  ) 
,("that" )  
,("those") 
,(null  ) 
,(null)  ;

id	Data
1	this
2	null
3	that
4	those
5	null
6	null

select *
  ,row_number()over(partition by grn order by id)new_id
from(
  select *
    ,sum(case when data is  null then 0 else 1 end)over(order by id) grn
  from test
  )t

id	Data	grn	new_id
1	this	1	1
2	null	1	2
3	that	2	1
4	those	3	1
5	null	3	2
6	null	3	3

fiddle

Upvotes: 0

Lisan Al Gaib

Reputation: 419

from pyspark.sql import Window
import pyspark.sql.functions as F

data = [
    ("this",),
    (None,),
    ("that",),
    ("those",),
    (None,),
    (None,),
]

df = spark.createDataFrame(data, ["Data"])

window_spec = Window.orderBy("id_col").rowsBetween(Window.unboundedPreceding, 0)

df = df.withColumn("id_col", F.monotonically_increasing_id())

df = df.withColumn("group_col", F.sum(F.when(F.col("Data").isNotNull(), 1).otherwise(0)).over(window_spec))

df = df.withColumn("ID", F.row_number().over(Window.partitionBy("group_col").orderBy("id_col")))

df = df.drop("id_col", "group_col")

df.show()

id_col: To add an incremental value, which will help us maintain the row ordering

group_col: Within this window, apply a cumulative sum using F.sum() so that we can get an identification that increases every time it hits a non null entry

This gives output:

+-------+---+
| Data  | ID|
+-------+---+
| this  | 1 |
| null  | 2 |
| that  | 1 |
| those | 1 |
| null  | 2 |
| null  | 3 |
+-------+---+

Upvotes: 0

Issue writing pyspark or sql query to conditionally count column

Answers (2)

Related Questions