Quiescent
Quiescent

Reputation: 1144

Creating a column of array using another column of array in a Spark Dataframe (Scala)

I am new to both Scala and Spark. I am trying to transform an input read from files as Double into Float (which is safe in this application) so as to reduce the memory usage. I have been able to do that with a column of Double.

Current approach for a single element:

import org.apache.spark.sql.functions.{col, udf}
val tcast = udf((s: Double) => s.toFloat)

val myDF = Seq(
   (1.0, Array(0.1, 2.1, 1.2)),
   (8.0, Array(1.1, 2.1, 3.2)),
   (9.0, Array(1.1, 1.1, 2.2))
).toDF("time", "crds")

myDF.withColumn("timeF", tcast(col("time"))).drop("time").withColumnRenamed("timeF", "time").show
myDF.withColumn("timeF", tcast(col("time"))).drop("time").withColumnRenamed("timeF", "time").schema

But currently stuck with transforming array of doubles to floats. Any help would be appreciated.

Upvotes: 0

Views: 421

Answers (1)

Cesar A. Mostacero
Cesar A. Mostacero

Reputation: 770

You can use selectExpr, like:

val myDF = Seq(
   (1.0, Array(0.1, 2.1, 1.2)),
   (8.0, Array(1.1, 2.1, 3.2)),
   (9.0, Array(1.1, 1.1, 2.2))
).toDF("time", "crds")

myDF.printSchema()

// output:
root
 |-- time: double (nullable = false)
 |-- crds: array (nullable = true)
 |    |-- element: double (containsNull = false)

val df = myDF.selectExpr("cast(time as float) time", "cast(crds as array<float>) as crds")
df.show()

+----+---------------+
|time|           crds|
+----+---------------+
| 1.0|[0.1, 2.1, 1.2]|
| 8.0|[1.1, 2.1, 3.2]|
| 9.0|[1.1, 1.1, 2.2]|
+----+---------------+

df.printSchema()

root
 |-- time: float (nullable = false)
 |-- crds: array (nullable = true)
 |    |-- element: float (containsNull = true)

Upvotes: 1

Related Questions