Reputation: 41
I have a dataframe which contains records identified by a key. But there might be a case where a key can get repetitive. My goal is to merge all the records based on that key as follows
Lets suppose my input dataframe looks something like this:
key | value1 | value2 | value3
-------------------------------
a | 1 | null | null
a | null | 2 | null
a | null | null | 3
and I want my output after merging based on 'a' should look something like as follows
key | value1 | value2 | value3
-------------------------------
a | 1 | 2 | 3
Now I am sure about this part either one the three values will be present against one record for the key 'a'.
Thanks
Upvotes: 0
Views: 786
Reputation: 35249
If you know there is only one record for group which is not null (or you don't care which one you'll get), you can use first
:
import org.apache.spark.sql.functions.{first, last}
val df = Seq(
("a", Some(1), None, None), ("a", None, Some(2), None),
("a", None, None, Some(3))
).toDF("key", "value1", "value2", "value3")
df.groupBy("key").agg(
first("value1", true) as "value1",
first("value2", true) as "value2",
first("value3", true) as "value3"
).show
// +---+------+------+------+
// |key|value1|value2|value3|
// +---+------+------+------+
// | a| 1| 2| 3|
// +---+------+------+------+
or last
:
df.groupBy("key").agg(
last("value1", true) as "value1",
last("value2", true) as "value2",
last("value3", true) as "value3"
).show
// +---+------+------+------+
// |key|value1|value2|value3|
// +---+------+------+------+
// | a| 1| 2| 3|
// +---+------+------+------+
Upvotes: 1