Reputation: 393
I have a scala dictionary with stateCodes maped to state names.
I have a dataframe with a column containing state codes.
I want to add a column to the dataframe with state names.
val stateCodes = Map("100" -> "ca",
"22" -> "ny",
"12" -> "tx,
"71" -> "va")
df =
+------------+--------+
| stateId| count|
+------------+--------+
| 100|17243160|
| 71|12398011|
| 12|12302708|
| 3| 9734593|
Expected Output
new_df =
+------------+--------+
|stateName | count|
+------------+--------+
| ca|17243160|
| va|12398011|
| tx|12302708|
| unknown| 9734593|
Upvotes: 0
Views: 846
Reputation: 5213
You can do it in a couple of ways. One, the easiest, is to transform the map to a dataframe and use join.
import spark.implicits._
val stateCodesDf = stateCodes.toSeq().toDF("stateId","stateName")
val newDf = df.join(stateCodesDF,Seq("stateId"),"left_outer").
select("stateName","count")
The second is to write a simple UDF
import org.apache.spark.sql.functions.udf
import spark.implicits._
val stateUdf = udf( (stateId:String) => stateCodes.get(stateId) )
val newDf = df.select(stateUdf($"stateId"), $"count")
Both solutions include in the output the entries that do not have a match in the table and will have NULL.
Upvotes: 1