Reputation: 443
I have a directory with 10 days of separate data Folders. Each Date folder has a JSON files as following.
[{"value": 5,"count" : 16,"currency":"EUR","date" : "2021-01-10"},{"value": 7,"count" : 166,"currency":"USD","date" : "2021-01-10"},{"value": 2,"count" : 188,"currency":"USD","date" : "2021-01-10"},{"value": 3,"count" : 114,"currency":"GBP","date" : "2021-01-11"},{"value": 5,"count" : 80,"currency":"USD","date" : "2021-01-11"},{"value": 10,"count" : 41,"currency":"GBP","date" : "2021-01-12"},{"value": 7,"count" : 84,"currency":"USD","date" : "2021-01-12"},{"value": 3,"count" : 147,"currency":"EUR","date" : "2021-01-15"},{"value": 2,"count" : 172,"currency":"USD","date" : "2021-01-15"},{"value": 10,"count" : 118,"currency":"USD","date" : "2021-01-15"}]
I have read the data using
val sc = sqlContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("/Users/khan/directory/*/*.json")
I want to read the last 3 available days and group data by date and currency and find mean value where currency is in USD
My Idea:
dates_currency = df.select('date', 'currency').distinct().groupBy(desc('date', 'currency')).limit(3)
dates_currency.select('date', 'currency').distinct().where('currency'=='USD').mean()
I am having problem with syntax?
Upvotes: 0
Views: 62
Reputation: 42392
You can use dense_rank
to get the 3 most recent days:
import org.apache.spark.sql.expressions.Window
val usd_mean = df.withColumn("rank", dense_rank().over(Window.partitionBy("currency").orderBy(desc("date"))))
.filter("rank <= 3 and currency = 'USD'")
.groupBy("date")
.agg(mean("value"))
usd_mean.show()
+----------+----------+
| date|avg(value)|
+----------+----------+
|2021-01-15| 6.0|
|2021-01-12| 7.0|
|2021-01-11| 5.0|
+----------+----------+
Upvotes: 1