Merge Multiple Records in a Dataframe based on a key in scala spark

Question

I have a dataframe which contains records identified by a key. But there might be a case where a key can get repetitive. My goal is to merge all the records based on that key as follows

Lets suppose my input dataframe looks something like this:

key | value1 | value2 | value3
-------------------------------
a   | 1      | null   | null
a   | null   | 2      | null
a   | null   | null   | 3

and I want my output after merging based on 'a' should look something like as follows

key | value1 | value2 | value3
-------------------------------
a   | 1      | 2      | 3

Now I am sure about this part either one the three values will be present against one record for the key 'a'.

Thanks

Alper t. Turker · Accepted Answer

If you know there is only one record for group which is not null (or you don't care which one you'll get), you can use first:

import org.apache.spark.sql.functions.{first, last}

val df = Seq(
  ("a", Some(1), None, None), ("a", None, Some(2), None),
  ("a", None, None, Some(3))
).toDF("key", "value1", "value2", "value3")

df.groupBy("key").agg(
  first("value1", true) as "value1", 
  first("value2", true) as "value2", 
  first("value3", true) as "value3"
).show  

// +---+------+------+------+
// |key|value1|value2|value3|
// +---+------+------+------+
// |  a|     1|     2|     3|
// +---+------+------+------+

or last:

df.groupBy("key").agg(
  last("value1", true) as "value1", 
  last("value2", true) as "value2", 
  last("value3", true) as "value3"
).show  


// +---+------+------+------+
// |key|value1|value2|value3|
// +---+------+------+------+
// |  a|     1|     2|     3|
// +---+------+------+------+

Merge Multiple Records in a Dataframe based on a key in scala spark

Answers (1)

Related Questions