Reputation: 627
I have to map a list of columns to another column in a Spark dataset: think something like this
val translationMap: Map[Column, Column] = Map(
lit("foo") -> lit("bar"),
lit("baz") -> lit("bab")
)
And I have a dataframe like this one:
val df = Seq("foo", "baz").toDF("mov")
So I intend to perform the translation like this:
df.select(
col("mov"),
translationMap(col("mov"))
)
but this piece of code spits the following error
key not found: movs
java.util.NoSuchElementException: key not found: movs
Is there a way to perform such translation without concatenating hundreds of when
s? think that translationMap
could have lots of pairs key-value.
Upvotes: 14
Views: 22718
Reputation: 4151
Instead of Map[Column, Column]
you should use a Column
containing a map literal:
import org.apache.spark.sql.functions.typedLit
val translationMap: Column = typedLit(Map(
"foo" -> "bar",
"baz" -> "bab"
))
The rest of your code can stay as-is:
df.select(
col("mov"),
translationMap(col("mov"))
).show
+---+---------------------------------------+
|mov|keys: [foo,baz], values: [bar,bab][mov]|
+---+---------------------------------------+
|foo| bar|
|baz| bab|
+---+---------------------------------------+
Upvotes: 20
Reputation: 10082
You can not refer a Scala collection declared on the driver like this inside a distributed dataframe. An alternative would be to use a UDF which will not be performance efficient if you have a large dataset since UDFs are not optimized by Spark.
val translationMap = Map( "foo" -> "bar" , "baz" -> "bab" )
val getTranslationValue = udf ((x: String)=>translationMap.getOrElse(x,null.asInstanceOf[String]) )
df.select(col("mov"), getTranslationValue($"mov").as("value") ).show
//+---+-----+
//|mov|value|
//+---+-----+
//|foo| bar|
//|baz| bab|
//+---+-----+
Another solution would be to load the Map
as a DataSet[(String, String)]
and the join the two datasets taking mov
as the key.
Upvotes: 2