alexanoid
alexanoid

Reputation: 25770

Scala Apache Spark and dynamic column list inside of DataFrame select method

I have the following Scala Spark code in order to parse the fixed width txt file:

val schemaDf = df.select(
  df("value").substr(0, 6).cast("integer").alias("id"),
  df("value").substr(7, 6).alias("date"),
  df("value").substr(13, 29).alias("string")
)

I'd like to extract the following code:

  df("value").substr(0, 6).cast("integer").alias("id"),
  df("value").substr(7, 6).alias("date"),
  df("value").substr(13, 29).alias("string")

into the dynamic loop in order to be able to define the column parsing in some external configuration, something like this(where x will hold the config for each column parsing but for now this is simple numbers for demo purpose):

val x = List(1, 2, 3)
val df1 = df.select(
    x.foreach { 
        df("value").substr(0, 6).cast("integer").alias("id") 
    }
)

but right now the following line df("value").substr(0, 6).cast("integer").alias("id") don't compile with the following error:

type mismatch; found : org.apache.spark.sql.Column required: Int ⇒ ?

What am I doing wrong and how to properly return the dynamic Column list inside of df.select method?

Upvotes: 1

Views: 879

Answers (1)

Dan W
Dan W

Reputation: 5782

The select won't take a statement as input, but you can save off the Columns you want to create and then expand the expression as input for the select:

val x = List(1, 2, 3)
val cols: List[Column] = x.map { i =>
  newRecordsDF("value").substr(0, 6).cast("integer").alias("id")
}
val df1 = df.select(cols: _*)

Upvotes: 2

Related Questions