Benji Kok
Benji Kok

Reputation: 332

Create array of literals and columns from List of Strings in Spark

I am trying to define functions in Scala that take a list of strings as input, and converts them into the columns passed to the dataframe array arguments used in the code below.

val df = sc.parallelize(Array((1,1),(2,2),(3,3))).toDF("foo","bar")
val df2 = df
        .withColumn("columnArray",array(df("foo").cast("String"),df("bar").cast("String")))
        .withColumn("litArray",array(lit("foo"),lit("bar")))

More specifically, I would like to create functions colFunction and litFunction (or just one function if possible) that takes a list of strings as an input parameter and can be used as follows:

val df = sc.parallelize(Array((1,1),(2,2),(3,3))).toDF("foo","bar")
val colString = List("foo","bar")
val df2 = df
         .withColumn("columnArray",array(colFunction(colString))
         .withColumn("litArray",array(litFunction(colString)))

I have tried mapping the colString to an Array of columns with all the transformations but this doesn't work.

Upvotes: 12

Views: 40577

Answers (2)

ZygD
ZygD

Reputation: 24356

To create df containing array type column (3 alternatives):

val df = Seq(
    (Seq("foo", "bar")),
    (Seq("baz", "qux")),
).toDF("col_name")
val df = Seq(
    (Array("foo", "bar")),
    (Array("baz", "qux")),
).toDF("col_name")
val df = Seq(
    (List("foo", "bar")),
    (List("baz", "qux")),
).toDF("col_name")

To add col of array type:

  • providing existent col names

    df.withColumn("new_col", array("col1", "col2"))
    
  • providing a list of existent col names

    df.withColumn("new_col", array(list_of_str map col: _*))
    
  • providing literal values (2 alternatives)

    df.withColumn("new_col", typedLit(Seq("foo", "bar")))
    df.withColumn("new_col", array(lit("foo"), lit("bar")))
    
  • providing a list of literal values (2 alternatives)

    df.withColumn("new_col", typedLit(list_of_str))
    df.withColumn("new_col", array(list_of_str map lit: _*))
    

Upvotes: 1

zero323
zero323

Reputation: 330063

Spark 2.2+:

Support for Seq, Map and Tuple (struct) literals has been added in SPARK-19254. According to tests:

import org.apache.spark.sql.functions.typedLit

typedLit(Seq("foo", "bar"))

Spark < 2.2

Just map with lit and wrap with array:

def asLitArray[T](xs: Seq[T]) = array(xs map lit: _*)

df.withColumn("an_array", asLitArray(colString)).show
// +---+---+----------+
// |foo|bar|  an_array|
// +---+---+----------+
// |  1|  1|[foo, bar]|
// |  2|  2|[foo, bar]|
// |  3|  3|[foo, bar]|
// +---+---+----------+

Regarding transformation from Seq[String] to Column of type Array this functionality is already provided by:

def array(colName: String, colNames: String*): Column 

or

def array(cols: Column*): Column

Example:

val cols = Seq("bar", "foo")

cols match { case x::xs => df.select(array(x, xs:_*)) 
// or 
df.select(array(cols map col: _*))

Of course all columns have to be of the same type.

Upvotes: 31

Related Questions