Reputation: 67
Trying to write a query that correctly creates and populates an ID column that counts until it encounters a new non-null value in the Data column, at which point it restarts at 1. Here's an example of the data.
+-------+
| Data |
+-------+
|"this" |
| null |
|"that" |
|"those"|
| null |
| null |
And here's a sample of what the output should look like:
+-------+---+
| Data | ID|
+-------+---+
|"this" | 1 |
| null | 2 |
|"that" | 1 |
|"those"| 1 |
| null | 2 |
| null | 3 |
I've attempted to use both row_number() and Monotonically_increasing_id() but neither are producing the output that I need.
Any ideas?
Thank you!
Upvotes: 0
Views: 64
Reputation: 5926
See SQL example
create table test(id int auto_increment,Data varchar(10),
PRIMARY KEY (id));
insert into test (data)values
("this")
,(null )
,("that" )
,("those")
,(null )
,(null) ;
id | Data |
---|---|
1 | this |
2 | null |
3 | that |
4 | those |
5 | null |
6 | null |
select *
,row_number()over(partition by grn order by id)new_id
from(
select *
,sum(case when data is null then 0 else 1 end)over(order by id) grn
from test
)t
id | Data | grn | new_id |
---|---|---|---|
1 | this | 1 | 1 |
2 | null | 1 | 2 |
3 | that | 2 | 1 |
4 | those | 3 | 1 |
5 | null | 3 | 2 |
6 | null | 3 | 3 |
Upvotes: 0
Reputation: 419
from pyspark.sql import Window
import pyspark.sql.functions as F
data = [
("this",),
(None,),
("that",),
("those",),
(None,),
(None,),
]
df = spark.createDataFrame(data, ["Data"])
window_spec = Window.orderBy("id_col").rowsBetween(Window.unboundedPreceding, 0)
df = df.withColumn("id_col", F.monotonically_increasing_id())
df = df.withColumn("group_col", F.sum(F.when(F.col("Data").isNotNull(), 1).otherwise(0)).over(window_spec))
df = df.withColumn("ID", F.row_number().over(Window.partitionBy("group_col").orderBy("id_col")))
df = df.drop("id_col", "group_col")
df.show()
id_col: To add an incremental value, which will help us maintain the row ordering
group_col: Within this window, apply a cumulative sum using F.sum()
so that we can get an identification that increases every time it hits a non null entry
This gives output:
+-------+---+
| Data | ID|
+-------+---+
| this | 1 |
| null | 2 |
| that | 1 |
| those | 1 |
| null | 2 |
| null | 3 |
+-------+---+
Upvotes: 0