Reputation: 332
I am trying to define functions in Scala that take a list of strings as input, and converts them into the columns passed to the dataframe array arguments used in the code below.
val df = sc.parallelize(Array((1,1),(2,2),(3,3))).toDF("foo","bar")
val df2 = df
.withColumn("columnArray",array(df("foo").cast("String"),df("bar").cast("String")))
.withColumn("litArray",array(lit("foo"),lit("bar")))
More specifically, I would like to create functions colFunction
and litFunction
(or just one function if possible) that takes a list of strings as an input parameter and can be used as follows:
val df = sc.parallelize(Array((1,1),(2,2),(3,3))).toDF("foo","bar")
val colString = List("foo","bar")
val df2 = df
.withColumn("columnArray",array(colFunction(colString))
.withColumn("litArray",array(litFunction(colString)))
I have tried mapping the colString
to an Array of columns with all the transformations but this doesn't work.
Upvotes: 12
Views: 40577
Reputation: 24356
To create df containing array type column (3 alternatives):
val df = Seq(
(Seq("foo", "bar")),
(Seq("baz", "qux")),
).toDF("col_name")
val df = Seq(
(Array("foo", "bar")),
(Array("baz", "qux")),
).toDF("col_name")
val df = Seq(
(List("foo", "bar")),
(List("baz", "qux")),
).toDF("col_name")
To add col of array type:
providing existent col names
df.withColumn("new_col", array("col1", "col2"))
providing a list of existent col names
df.withColumn("new_col", array(list_of_str map col: _*))
providing literal values (2 alternatives)
df.withColumn("new_col", typedLit(Seq("foo", "bar")))
df.withColumn("new_col", array(lit("foo"), lit("bar")))
providing a list of literal values (2 alternatives)
df.withColumn("new_col", typedLit(list_of_str))
df.withColumn("new_col", array(list_of_str map lit: _*))
Upvotes: 1
Reputation: 330063
Spark 2.2+:
Support for Seq
, Map
and Tuple
(struct
) literals has been added in SPARK-19254. According to tests:
import org.apache.spark.sql.functions.typedLit
typedLit(Seq("foo", "bar"))
Spark < 2.2
Just map
with lit
and wrap with array
:
def asLitArray[T](xs: Seq[T]) = array(xs map lit: _*)
df.withColumn("an_array", asLitArray(colString)).show
// +---+---+----------+
// |foo|bar| an_array|
// +---+---+----------+
// | 1| 1|[foo, bar]|
// | 2| 2|[foo, bar]|
// | 3| 3|[foo, bar]|
// +---+---+----------+
Regarding transformation from Seq[String]
to Column
of type Array
this functionality is already provided by:
def array(colName: String, colNames: String*): Column
or
def array(cols: Column*): Column
Example:
val cols = Seq("bar", "foo")
cols match { case x::xs => df.select(array(x, xs:_*))
// or
df.select(array(cols map col: _*))
Of course all columns have to be of the same type.
Upvotes: 31