Reputation: 2431
My input DataFrame is looks like below
+-------+
|Column1|
+-------+
|FILE_1 |
|11 |
|12 |
|212 |
|FILE_2 |
|1213 |
|13312 |
|4124 |
+-------+
In input DataFrame
FILE_* is the file name and data below FILE_ are its contents.
For example, FILE_1 has 11,12,212 contents.
I want to add new column into existing Input DataFrame which give me this relation of File Name and its contents like below:
+-------+-------+
|Column1|Column2|
+-------+-------+
|FILE_1 |FILE_1 |
|11 |FILE_1 |
|12 |FILE_1 |
|212 |FILE_1 |
|FILE_2 |FILE_2 |
|1213 |FILE_2 |
|13312 |FILE_2 |
|4124 |FILE_2 |
+-------+-------+
Upvotes: 0
Views: 47
Reputation: 26
This is one approach to do that:
import org.apache.spark.sql.functions.{last, when}
import org.apache.spark.sql.expressions.Window
val df = Seq("FILE_1","11","12","212","FILE_2","1213","3312","4124").toDF("column1")
val w = Window.rowsBetween(Window.unboundedPreceding, 0)
val df2 = df.withColumn("tmp", when($"column1".startsWith("FILE_"), $"column1").otherwise(null))
.withColumn("column2", last($"tmp", true).over(w)).drop("tmp")
The result looks something like that
Get more info on Spark Window through this great Databricks post.
Upvotes: 1