Reputation: 11
I have a query which needs a count of colA groupbed by colB witha particular average of COlC value. for example
SELECT COUNT( X.colA ), X.colB , X.MEASURE
FROM (
SELECT colA , colB , avg(colC) MEASURE
FROM tableA
GROUP BY colA, colB
HAVING round(avg(colC),2) > 0
) X
GROUP BY X.MEASURE , X.colB
HAVING X.MEASURE BETWEEN 0 AND 3000
ORDER BY MEASURE
Example result could be
No of User, URL , average time spent
90182 , abc.com, 334
293556 , def.com, 33
Problem with above query is that since it has a sub query the inner sub query shuffles a huge amount of data as a intermediate result to outer query which results in query becoming very slow on large data sets.
Is there a way we can convert above query to a query without any sub -query or is there any UDAF available so there is no more major shuffle of intermediate data and it runs in a single stage ?
Upvotes: 1
Views: 445
Reputation: 1269873
I don't see an easy way to simplify the query. But, moving the having
clause into the subquery might help the performance a bit:
SELECT COUNT( X.colA ), X.colB , X.MEASURE
FROM (SELECT colA , colB , avg(colC) MEASURE
FROM tableA
GROUP BY colA, colB
HAVING round(avg(colC),2) > 0 and avg(colC) <= 3000
) X
GROUP BY X.MEASURE , X.colB
ORDER BY MEASURE;
You want to aggregate by the average of a set of rows. This seems to require two operations -- one to calculate the average and the other for the final aggregation.
Actually, if I think about it, the following might do what you want:
select colB, count(distinct colA), sum(colC) / count(distinct colA) as measure
from tableA
group by colA
having sum(colC) / count(distinct colA) between 0 and 3000
order by measure;
It is not exactly the same, but I don't understand the purpose of grouping by measure
in the outer query. Perhaps a summary with one row for each b
value would be sufficient.
Upvotes: 1