How to populate last not null data into column?

Question

My input DataFrame is looks like below

+-------+
|Column1|
+-------+
|FILE_1 |
|11     |
|12     |
|212    |
|FILE_2 |
|1213   |
|13312  |
|4124   |
+-------+

In input DataFrame FILE_* is the file name and data below FILE_ are its contents.

For example, FILE_1 has 11,12,212 contents.

I want to add new column into existing Input DataFrame which give me this relation of File Name and its contents like below:

+-------+-------+
|Column1|Column2|
+-------+-------+
|FILE_1 |FILE_1 |
|11     |FILE_1 |
|12     |FILE_1 |
|212    |FILE_1 |
|FILE_2 |FILE_2 |
|1213   |FILE_2 |
|13312  |FILE_2 |
|4124   |FILE_2 |
+-------+-------+

Random_User_11 · Accepted Answer

This is one approach to do that:

import org.apache.spark.sql.functions.{last, when}
import org.apache.spark.sql.expressions.Window

    val df = Seq("FILE_1","11","12","212","FILE_2","1213","3312","4124").toDF("column1")
    val w = Window.rowsBetween(Window.unboundedPreceding, 0)
    val df2 = df.withColumn("tmp", when($"column1".startsWith("FILE_"), $"column1").otherwise(null))
      .withColumn("column2", last($"tmp", true).over(w)).drop("tmp")

The result looks something like that

Get more info on Spark Window through this great Databricks post.

How to populate last not null data into column?

Answers (1)

Related Questions