Column manipulations in Spark Scala

Question

I am learning to work with Apache Spark (Scala) and still figuring out how things work out here

I am trying to achieve a simple task of

Finding max of column
Subtract each value of the column from this max and create a new column

The code I am using is

import org.apache.spark.sql.functions._
val training = sqlContext.createDataFrame(Seq(
  (10),
  (13),
  (14),
  (21)
)).toDF("Values")

val training_max = training.withColumn("Val_Max",training.groupBy().agg(max("Values"))
val training_max_sub = training_max.withColumn("Subs",training_max.groupBy().agg(col("Val_Max")-col("Values) ))

However I am getting a lot of errors. I am more or less fluent in R and had I been doing the same task my code would have been:

library(dplyr)
new_data <- training %>%
    mutate(Subs= max(Values) - Values)

cheseaux · Accepted Answer

Here is a solution using window functions. You'll need a HiveContext to use them

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

val training = sc.parallelize(Seq(10,13,14,21)).toDF("values")

training.withColumn("subs", 
   max($"values").over(Window.partitionBy()) - $"values").show

Which produces the expected output :

+------+----+
|values|subs|
+------+----+
|    10|  11|
|    13|   8|
|    14|   7|
|    21|   0|
+------+----+

Column manipulations in Spark Scala

Answers (1)

Related Questions