Reputation: 1985
Assume my table looks like this:
cust_id, domain, year, mon, day
1, google.au, 2018, 10, 1
2, virgin.com.au, 2018, 10, 1
3, hotmail.au, 2018, 10, 1
4, yahoo.au, 2018, 10, 1
1, foobar.au, 2018, 10, 1
3, foobar.com.au, 2018, 10, 1
15, haha.com, 2018, 10, 1
11, hehe.net, 2018, 10, 1
I need to group by year/mon/day and aggregate columns based on different conditions:
1) count of distinct domains ending with .au but not .com.au
2) count of distinct domains ending with .com.au
3) count of distinct hostnames where cust_id in a specific list, let's assume (1, 2, 3, 4)
4) count of all distinct hostnames
So my output would look like:
2018, 10, 1, 4, 2, 6, 8
I'm leaning towards using subqueries for each condition and then joining them:
select condition_1.year, condition_1.mon, condition_1.day, condition_1.c1, condition_3.c3, condition_4.c4
from
(select year, mon, day, count(distinct domain) c1 from mytable where year = 2018 and mon = 10 and day = 1
and domain rlike '[.]au' and domain not rlike '[.]com[.]au'
group by year, mon, day) condition_1
full outer join
(select count(distinct domain) c2 from mytable where year = 2018 and mon = 10 and day = 1
and domain rlike '[.]com[.]au') condition_2
full outer join
(select count(distinct domain) c3 from mytable where year = 2018 and mon = 10 and day = 1
and cust_id in (1, 2, 3, 4)) condition_3
full outer join
(select count(distinct hostname) c4 from mytable where year = 2018 and mon = 10 and day = 1) condition_4
This seems horribly inefficient, though I can't think of a better way. The CASE statement would not work here as I need distinct counts. How could I achieve this more efficiently?
Upvotes: 1
Views: 142
Reputation: 38290
Use collect_set()
- it collects distinct set, ignoring NULLs, use size
function to get the number of elements (already distinct):
select
year, mon, day,
size(condition_1) as condition_1_cnt,
size(condition_2) as condition_2_cnt,
size(condition_3) as condition_3_cnt,
size(condition_4) as condition_4_cnt
from
(
select year, mon, day,
collect_set(case when domain rlike '(?<![.]com)[.]au' then domain end) condition_1,
collect_set(case when domain rlike '[.]com[.]au' then domain end) condition_2,
collect_set(case when cust_id in (1, 2, 3, 4) then domain end) condition_3,
collect_set(hostname) condition_4
from mytable
where year = 2018 and mon = 10 and day = 1
group by year, mon, day
)s;
Upvotes: 0
Reputation: 49260
This can be accomplished with regular expressions and with conditional aggregation.
select year,mon,day
,count(distinct case when domain regexp '(?<!\.com)\.au$' then domain end) as ends_with_au
,count(distinct case when domain regexp '\.com\.au$' then domain end) as ends_with_com_au
,count(distinct case when cust_id in (1,2,3,4) then domain end) as specific_cust
,count(distinct domain) as all_domains
from mytable
group by year,mon,day
The regexp (?<!\.com)\.au$
uses a negative lookbehind assertion to check for the preceding characters to .au
are not .com
. $
metacharacter means match .au
as the last 3 characters in the string. .
has to be escaped with \
.
Upvotes: 1