user7394882
user7394882

Reputation: 193

select with aggrgation spark and scala

iwrote this in pySpark

result = \
df.select('*', date_format('window_start', 'yyyy-MM-dd hh:mm').alias('time_window')) \
.groupby('time_window') \
.agg({'total_score': 'sum'})
result.show()

i want to make it work in scala language with spark i did this i got i error i didn't undrstand the error cuz am new to scala

val result=df.select('*', date_format(df("time_window"),"yyyy-MM-dd hh:mm").alias("time_window"))
.groupBy("time_window") 
.agg(sum("total_score"))

the error said

overloaded method value select with alternatives: [U1, U2](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1], c2: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U2])org.apache.spark.sql.Dataset[(U1, U2)] (col: String,cols: String*)org.apache.spark.sql.DataFrame (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame cannot be applied to (Char, org.apache.spark.sql.Column) Process.scala /Process/src line 30 Scala Problem

How can i fix the source code to make it run under scala

Upvotes: 0

Views: 918

Answers (1)

koiralo
koiralo

Reputation: 23119

This works similar as your pyspark code

  val data =  spark.sparkContext.parallelize(Seq(
    ("2017-05-21", 1),
  ("2017-05-21", 1),
  ("2017-05-22", 1),
  ("2017-05-22", 1),
  ("2017-05-23", 1),
  ("2017-05-23", 1),
  ("2017-05-23", 1),
  ("2017-05-23", 1))).toDF("time_window", "foo")

  data.withColumn("$time_window", date_format(data("time_window"),"yyyy-MM-dd hh:mm"))
    .groupBy("$time_window")
    .agg(sum("foo")).show

Upvotes: 0

Related Questions