mrbolichi
mrbolichi

Reputation: 627

Use Map to replace column values in Spark

I have to map a list of columns to another column in a Spark dataset: think something like this

val translationMap: Map[Column, Column] = Map(
  lit("foo") -> lit("bar"),
  lit("baz") -> lit("bab")
)

And I have a dataframe like this one:

val df = Seq("foo", "baz").toDF("mov")

So I intend to perform the translation like this:

df.select(
  col("mov"),
  translationMap(col("mov"))
)

but this piece of code spits the following error

key not found: movs
java.util.NoSuchElementException: key not found: movs

Is there a way to perform such translation without concatenating hundreds of whens? think that translationMap could have lots of pairs key-value.

Upvotes: 14

Views: 22718

Answers (2)

user10938362
user10938362

Reputation: 4151

Instead of Map[Column, Column] you should use a Column containing a map literal:

import org.apache.spark.sql.functions.typedLit

val translationMap: Column = typedLit(Map(
  "foo" -> "bar",
  "baz" -> "bab"
))

The rest of your code can stay as-is:

df.select(
  col("mov"),
  translationMap(col("mov"))
).show
+---+---------------------------------------+
|mov|keys: [foo,baz], values: [bar,bab][mov]|
+---+---------------------------------------+
|foo|                                    bar|
|baz|                                    bab|
+---+---------------------------------------+

Upvotes: 20

philantrovert
philantrovert

Reputation: 10082

You can not refer a Scala collection declared on the driver like this inside a distributed dataframe. An alternative would be to use a UDF which will not be performance efficient if you have a large dataset since UDFs are not optimized by Spark.

val translationMap = Map( "foo" -> "bar" , "baz" -> "bab" )
val getTranslationValue = udf ((x: String)=>translationMap.getOrElse(x,null.asInstanceOf[String]) )
df.select(col("mov"), getTranslationValue($"mov").as("value")  ).show

//+---+-----+
//|mov|value|
//+---+-----+
//|foo|  bar|
//|baz|  bab|
//+---+-----+

Another solution would be to load the Map as a DataSet[(String, String)] and the join the two datasets taking mov as the key.

Upvotes: 2

Related Questions