Reputation: 1144
I am new to both Scala and Spark. I am trying to transform an input read from files as Double into Float (which is safe in this application) so as to reduce the memory usage. I have been able to do that with a column of Double.
Current approach for a single element:
import org.apache.spark.sql.functions.{col, udf}
val tcast = udf((s: Double) => s.toFloat)
val myDF = Seq(
(1.0, Array(0.1, 2.1, 1.2)),
(8.0, Array(1.1, 2.1, 3.2)),
(9.0, Array(1.1, 1.1, 2.2))
).toDF("time", "crds")
myDF.withColumn("timeF", tcast(col("time"))).drop("time").withColumnRenamed("timeF", "time").show
myDF.withColumn("timeF", tcast(col("time"))).drop("time").withColumnRenamed("timeF", "time").schema
But currently stuck with transforming array of doubles to floats. Any help would be appreciated.
Upvotes: 0
Views: 421
Reputation: 770
You can use selectExpr
, like:
val myDF = Seq(
(1.0, Array(0.1, 2.1, 1.2)),
(8.0, Array(1.1, 2.1, 3.2)),
(9.0, Array(1.1, 1.1, 2.2))
).toDF("time", "crds")
myDF.printSchema()
// output:
root
|-- time: double (nullable = false)
|-- crds: array (nullable = true)
| |-- element: double (containsNull = false)
val df = myDF.selectExpr("cast(time as float) time", "cast(crds as array<float>) as crds")
df.show()
+----+---------------+
|time| crds|
+----+---------------+
| 1.0|[0.1, 2.1, 1.2]|
| 8.0|[1.1, 2.1, 3.2]|
| 9.0|[1.1, 1.1, 2.2]|
+----+---------------+
df.printSchema()
root
|-- time: float (nullable = false)
|-- crds: array (nullable = true)
| |-- element: float (containsNull = true)
Upvotes: 1