Dcook
Dcook

Reputation: 961

How to avoid duplicate data scanning while running athena query

I have two query to calculate some attributes from table- 'agg_table'.The second one is basically to find out median value grouped by msgdate.My expected output should have these 5 fields:

msgdate,avg-Total,avg-duration,stddev and median. Currently I am doing by using UNION which works fine. I will execute this query in AWS Athena. To calculate median since the second query is accessing agg_data again, the data scan is being doubled, lets say input data size is 4 mb and in the Athena history page I can see data scanned is 8 mb.

I want to avoid the data scan in second time to save cost. Can you please help me to acheive this by calling agg_data table one time only?

Query 1: To calculate avg-Total,avg-duration,stddev

SELECT b.msgdate1 as msgdate,ROUND(b.avrg,3) AS avg-Total,
ROUND(AVG(b.duration),3) AS avg-duration,ROUND(b.stdv,3) AS stddev
FROM
(
SELECT AVG(a2.duration) OVER(PARTITION BY a2.msgdate) AS avrg, a2.duration as duration,a2.msgdate msgdate1,
  CASE
      WHEN stddev(a2.duration) OVER(PARTITION BY a2.msgdate) IS NULL THEN 0
      ELSE stddev(a2.duration) OVER(PARTITION BY a2.msgdate)
  END AS stdv
  FROM (
         agg_data
       ) a2 
) AS b
  

Query 2: To calculate median

WITH RankedTable AS 
    (
        SELECT msgdate, duration, 
            ROW_NUMBER() OVER (PARTITION BY msgdate ORDER BY duration) AS Rnk,
            COUNT(*) OVER (PARTITION BY msgdate) AS Cnt
        FROM agg_data
    )
    SELECT msgdate,duration as median
    FROM RankedTable
    WHERE Rnk = Cnt / 2 + 1 or Cnt=1

Upvotes: 1

Views: 529

Answers (1)

Theo
Theo

Reputation: 132922

I'm sure there's some trick that could do what you ask, but it will not be the easiest thing to do with all the window functions – combining these is always complicated.

If you can live with an approximation you could use the approx_percentile function – approx_percentile(column, 0.5) will be an approximation of the median. This can be used in your first query, avoiding the need for the second.

Upvotes: 1

Related Questions