map dataframe column values to a to a scala dictionary

Question

I have a scala dictionary with stateCodes maped to state names.
I have a dataframe with a column containing state codes. I want to add a column to the dataframe with state names.

val stateCodes = Map("100" -> "ca",
                 "22" -> "ny",
                 "12" -> "tx,
                 "71" -> "va")

df = 
+------------+--------+
|     stateId|   count|
+------------+--------+
|         100|17243160|
|          71|12398011|
|          12|12302708|
|           3| 9734593|

Expected Output

new_df = 
+------------+--------+
|stateName   |   count|
+------------+--------+
|          ca|17243160|
|          va|12398011|
|          tx|12302708|
|     unknown| 9734593|

Roberto Congiu · Accepted Answer

You can do it in a couple of ways. One, the easiest, is to transform the map to a dataframe and use join.

import spark.implicits._
val stateCodesDf = stateCodes.toSeq().toDF("stateId","stateName")
val newDf = df.join(stateCodesDF,Seq("stateId"),"left_outer").
    select("stateName","count")

The second is to write a simple UDF

import org.apache.spark.sql.functions.udf
import spark.implicits._
val stateUdf = udf( (stateId:String) => stateCodes.get(stateId) )
val newDf = df.select(stateUdf($"stateId"), $"count")

Both solutions include in the output the entries that do not have a match in the table and will have NULL.

map dataframe column values to a to a scala dictionary

Answers (1)

Related Questions