MLstudent
MLstudent

Reputation: 89

Scala Spark, compare two DataFrames and select the value of another column

I have two dataframes. What I want to do exactly is:

If the column Name is "P" then I have to select the column called FinalValue of the DF2 where the column id_1 match the column Id_name of the DF2, otherwise I have to fill it with nulls.

For example, I have the following DataFrames (DF1 and DF2):

+--------+-------+-------+
|Name    | value | id_1  |
+- ------+-------+-------+
|P       |5      | being |      
|X       |1      | dose  |
|Z       |1      | yex   |

df2
+--------+------------+
|Id_name | FinalValue |
+- ------+------------+
|ash     | h32        |
|being   | c11        |
|dose    | g21        |

In this case the output should be:

+--------+-------+-------------+
|Name    | value | FinalValue  | 
+- ------+-------+-------------+
|P       |5      | c11         |      
|X       |1      | null        |
|Z       |1      | null        |

What I am trying is the following:

var df3 = df1.withColumn("FinalValue", when($"Name" === "P", df2.select(...)))

But as you can see, I don't know how to continue because if I select a column of the DF2 I can't select another of the DF1. How can I do this?

Maybe my explanation is not good enough, if you need more information or explanation, just tell me it. Thanks in advance.

Upvotes: 0

Views: 756

Answers (1)

mck
mck

Reputation: 42352

You can do a left join, then mask the final value using when:

val df3 = df1.join(
    df2,
    df1("id_1") === df2("Id_name"),
    "left"
).select(
    df1.columns.dropRight(1).map(col) :+
    when($"Name" === "P", $"FinalValue").as("FinalValue")
    : _*
)

df3.show
+----+-----+----------+
|Name|value|FinalValue|
+----+-----+----------+
|   P|    5|       c11|
|   X|    1|      null|
|   Z|    1|      null|
+----+-----+----------+

Upvotes: 1

Related Questions