Fill null values in a row with frequency of other column

Question

In a spark structured streaming context, I have this dataframe :

+------+----------+---------+
|brand |Timestamp |frequency|
+------+----------+---------+
|BR1   |1632899456|4        |
|BR1   |1632901256|4        |
|BR300 |1632901796|null     |
|BR300 |1632899155|null     |
|BR90  |1632901743|1        |
|BR1   |1632899933|4        |
|BR1   |1632899756|4        |
|BR22  |1632900776|null     |
|BR22  |1632900176|null     |
+------+----------+---------+

I would like to replace the null values by the frequency of the brand in the batch, in order to obtain a dataframe like this one :

+------+----------+---------+
|brand |Timestamp |frequency|
+------+----------+---------+
|BR1   |1632899456|4        |
|BR1   |1632901256|4        |
|BR300 |1632901796|2        | 
|BR300 |1632899155|2        |
|BR90  |1632901743|1        |
|BR1   |1632899933|4        |
|BR1   |1632899756|4        |
|BR22  |1632900776|2        |
|BR22  |1632900176|2        |
+------+----------+---------+

I am using Spark version 2.4.3 and SQLContext, with scala language.

pasha701 · Accepted Answer

With "count" over window function:

val df = Seq(
  ("BR1", 1632899456, Some(4)),
  ("BR1", 1632901256, Some(4)),
  ("BR300", 1632901796, None),
  ("BR300", 1632899155, None),
  ("BR90", 1632901743, Some(1)),
  ("BR1", 1632899933, Some(4)),
  ("BR1", 1632899756, Some(4)),
  ("BR22", 1632900776, None),
  ("BR22", 1632900176, None)
).toDF("brand", "Timestamp", "frequency")

val brandWindow = Window.partitionBy("brand")
val result = df.withColumn("frequency", when($"frequency".isNotNull, $"frequency").otherwise(count($"brand").over(brandWindow)))

Result:

+-----+----------+---------+
|BR1  |1632899456|4        |
|BR1  |1632901256|4        |
|BR1  |1632899933|4        |
|BR1  |1632899756|4        |
|BR22 |1632900776|2        |
|BR22 |1632900176|2        |
|BR300|1632901796|2        |
|BR300|1632899155|2        |
|BR90 |1632901743|1        |
+-----+----------+---------+

Solution with GroupBy:

val countDF = df.select("brand").groupBy("brand").count()


df.alias("df")
  .join(countDF.alias("cnt"), Seq("brand"))
  .withColumn("frequency", when($"df.frequency".isNotNull, $"df.frequency").otherwise($"cnt.count"))
  .select("df.brand", "df.Timestamp", "frequency")

Fill null values in a row with frequency of other column

Answers (2)

Related Questions