Vikrant Bhalerao
Vikrant Bhalerao

Reputation: 63

Scala - Convert column having comma separated numbers (currently string) to Array of Double in Dataframe

I have a column in DataFrame which is currently in String format having multiple comma separated double datatype values (mostly 2 or 3). Refer to below schema snapshot.

Sample : "619.619620621622, 123.12412512699"

root
 |-- MyCol: string (nullable = true)

I want to convert it to an Array of double which should look like below schema.

Desired : array<double>
[619.619620621622, 123.12412512699]

root
 |-- MyCol: array (nullable = true)
 |    |-- element_value: double (containsNull = true)

I know how to do it on single string value. Now I want to to it on complete DataFrame column.

Is there any way this could be done using single/ double liner code?

Upvotes: 1

Views: 1069

Answers (2)

gatear
gatear

Reputation: 946

Assuming the starting point:

val spark: SparkSession = ???
import spark.implicits._

val df: DataFrame = ???

here is a solution based on UDF:

import org.apache.spark.sql.functions._

def toDoubles: UserDefinedFunction =
  udf { string: String =>
    string
      .split(",")
      .map(_.trim) //based on your input you may need to trim the strings
      .map(_.toDouble)
  }

df
  .select(toDoubles($"MyCol") as "doubles")

Edit: the toDouble conversion already trims the string

Upvotes: 1

blackbishop
blackbishop

Reputation: 32700

split + cast should do the job:

import org.apache.spark.sql.functions.{col, split}

val df = Seq(("619.619620621622, 123.12412512699")).toDF("MyCol")

val df2 = df.withColumn("myCol", split(col("MyCol"), ",").cast("array<double>"))

df2.printSchema

//root
// |-- myCol: array (nullable = true)
// |    |-- element: double (containsNull = true)

Upvotes: 2

Related Questions