Reputation: 1944
I am learning to work with Apache Spark (Scala) and still figuring out how things work out here
I am trying to achieve a simple task of
The code I am using is
import org.apache.spark.sql.functions._
val training = sqlContext.createDataFrame(Seq(
(10),
(13),
(14),
(21)
)).toDF("Values")
val training_max = training.withColumn("Val_Max",training.groupBy().agg(max("Values"))
val training_max_sub = training_max.withColumn("Subs",training_max.groupBy().agg(col("Val_Max")-col("Values) ))
However I am getting a lot of errors. I am more or less fluent in R and had I been doing the same task my code would have been:
library(dplyr)
new_data <- training %>%
mutate(Subs= max(Values) - Values)
Upvotes: 2
Views: 479
Reputation: 5315
Here is a solution using window functions. You'll need a HiveContext
to use them
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val training = sc.parallelize(Seq(10,13,14,21)).toDF("values")
training.withColumn("subs",
max($"values").over(Window.partitionBy()) - $"values").show
Which produces the expected output :
+------+----+
|values|subs|
+------+----+
| 10| 11|
| 13| 8|
| 14| 7|
| 21| 0|
+------+----+
Upvotes: 3