Reputation: 213
I have a table and after the first number shows up, the number needs to be repeated fill in all the column after
here is a table example:
student_id 2016Q1 2016Q2 2016Q3 2016Q4 2017Q1 2017Q2 2017Q3
1 1
2 1
3 1
4 1
5 1
6 1
7 1
I am using pyspark and Python. I try to do it manually in excel, but that is too much manual editing. I wonder if there any function in pyspark or pyspark.sql are able to generate the following table. Thanks!
I expect the result to look like the following table
student_id 2016Q1 2016Q2 2016Q3 2016Q4 2017Q1 2017Q2 2017Q3
1 1 1 1 1 1 1
2 1 1 1 1 1 1 1
3 1 1 1 1 1
4 1 1 1
5 1 1 1
6 1 1
7 1 1 1
Upvotes: 2
Views: 241
Reputation: 43494
Assuming that those blanks are actually null
s*, you can use pyspark.sql.functions.coalesce
.
Enumerate your columns and coalesce
the values of every column before and including the current column. This will return the first non-null value in that row.
from pyspark.sql.functions import coalesce
fill_cols = df.columns[1:] # exclude the student_id
df.select(
"student_id",
*[coalesce(*fill_cols[:i+1]).alias(c) for i, c in enumerate(fill_cols)]
).show()
#+----------+------+------+------+------+------+------+------+
#|student_id|2016Q1|2016Q2|2016Q3|2016Q4|2017Q1|2017Q2|2017Q3|
#+----------+------+------+------+------+------+------+------+
#| 1| null| 1| 1| 1| 1| 1| 1|
#| 2| 1| 1| 1| 1| 1| 1| 1|
#| 3| null| null| 1| 1| 1| 1| 1|
#| 4| null| null| null| null| 1| 1| 1|
#| 5| null| null| null| null| 1| 1| 1|
#| 6| null| null| null| null| null| 1| 1|
#| 7| null| null| null| null| 1| 1| 1|
#+----------+------+------+------+------+------+------+------+
*If not, first replace blanks with null.
Upvotes: 1