Reputation: 961
I have two query to calculate some attributes from table- 'agg_table'.The second one is basically to find out median value grouped by msgdate.My expected output should have these 5 fields:
msgdate,avg-Total,avg-duration,stddev and median. Currently I am doing by using UNION which works fine. I will execute this query in AWS Athena. To calculate median since the second query is accessing agg_data again, the data scan is being doubled, lets say input data size is 4 mb and in the Athena history page I can see data scanned is 8 mb.
I want to avoid the data scan in second time to save cost. Can you please help me to acheive this by calling agg_data table one time only?
Query 1: To calculate avg-Total,avg-duration,stddev
SELECT b.msgdate1 as msgdate,ROUND(b.avrg,3) AS avg-Total,
ROUND(AVG(b.duration),3) AS avg-duration,ROUND(b.stdv,3) AS stddev
FROM
(
SELECT AVG(a2.duration) OVER(PARTITION BY a2.msgdate) AS avrg, a2.duration as duration,a2.msgdate msgdate1,
CASE
WHEN stddev(a2.duration) OVER(PARTITION BY a2.msgdate) IS NULL THEN 0
ELSE stddev(a2.duration) OVER(PARTITION BY a2.msgdate)
END AS stdv
FROM (
agg_data
) a2
) AS b
Query 2: To calculate median
WITH RankedTable AS
(
SELECT msgdate, duration,
ROW_NUMBER() OVER (PARTITION BY msgdate ORDER BY duration) AS Rnk,
COUNT(*) OVER (PARTITION BY msgdate) AS Cnt
FROM agg_data
)
SELECT msgdate,duration as median
FROM RankedTable
WHERE Rnk = Cnt / 2 + 1 or Cnt=1
Upvotes: 1
Views: 529
Reputation: 132922
I'm sure there's some trick that could do what you ask, but it will not be the easiest thing to do with all the window functions – combining these is always complicated.
If you can live with an approximation you could use the approx_percentile
function – approx_percentile(column, 0.5)
will be an approximation of the median. This can be used in your first query, avoiding the need for the second.
Upvotes: 1