Khaned
Khaned

Reputation: 443

use spark-shell to group data and find mean on distinct dates from json files

I have a directory with 10 days of separate data Folders. Each Date folder has a JSON files as following.

[{"value": 5,"count" : 16,"currency":"EUR","date" : "2021-01-10"},{"value": 7,"count" : 166,"currency":"USD","date" : "2021-01-10"},{"value": 2,"count" : 188,"currency":"USD","date" : "2021-01-10"},{"value": 3,"count" : 114,"currency":"GBP","date" : "2021-01-11"},{"value": 5,"count" : 80,"currency":"USD","date" : "2021-01-11"},{"value": 10,"count" : 41,"currency":"GBP","date" : "2021-01-12"},{"value": 7,"count" : 84,"currency":"USD","date" : "2021-01-12"},{"value": 3,"count" : 147,"currency":"EUR","date" : "2021-01-15"},{"value": 2,"count" : 172,"currency":"USD","date" : "2021-01-15"},{"value": 10,"count" : 118,"currency":"USD","date" : "2021-01-15"}]

I have read the data using

val sc = sqlContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("/Users/khan/directory/*/*.json")

I want to read the last 3 available days and group data by date and currency and find mean value where currency is in USD

My Idea:

dates_currency = df.select('date', 'currency').distinct().groupBy(desc('date', 'currency')).limit(3)
dates_currency.select('date', 'currency').distinct().where('currency'=='USD').mean()

I am having problem with syntax?

Upvotes: 0

Views: 62

Answers (1)

mck
mck

Reputation: 42392

You can use dense_rank to get the 3 most recent days:

import org.apache.spark.sql.expressions.Window

val usd_mean = df.withColumn("rank", dense_rank().over(Window.partitionBy("currency").orderBy(desc("date"))))
                 .filter("rank <= 3 and currency = 'USD'")
                 .groupBy("date")
                 .agg(mean("value"))

usd_mean.show()
+----------+----------+
|      date|avg(value)|
+----------+----------+
|2021-01-15|       6.0|
|2021-01-12|       7.0|
|2021-01-11|       5.0|
+----------+----------+

Upvotes: 1

Related Questions