Reputation: 1760
I've got a Sqlite database with close to 500,000 rows worth of access log information in it. I'm using it for aggregate information like "number of times each ip has hit the site", or "percentage of hits were POST", etc.
I wrote a SQL query that gathers how many times each IP address has hit the site, where the number of occurrences is greater than 1% of the count of the IP addresses.
select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
This returns about 7 significant IP addresses. How would I go about unioning an "All Others" row to the result set?
I tried UNIONing with the logical opposite
select "All Others", count(ip_address)
from records
group by ip_address
having count(ip_address) < (select count(ip_address) from records) * .01
but this returns multiple "All Other" rows, with a count that is sequential.
Upvotes: 1
Views: 106
Reputation: 17925
Without CTEs this might be best (I'm not sure what sqlite allows). Using not in
prevents you from having to write the reverse of your condition which in other cases could more complicated by nulls or floating-point math considerations:
select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
union all
select 'All others', count(*)
from records
where ip_address not in (
select ip_address /* assuming non-null ip_address */
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
)
Otherwise:
with topPercent as (
select ip_address, count(ip_address) as addr_cnt
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
)
select ip_address, addr_cnt from topPercent
union all
select 'All others', count(distinct ip_address) - (select count(*) from topPercent)
If analytic functions were available a third option might be fastest:
select case when pct > 0.01 then ip_address else 'All others' end, sum(addr_cnt)
from (
select ip_address, addr_cnt, addr_cnt * 1.0e / sum(addr_cnt) over () as pct
from (
select ip_address, count(ip_address) as addr_cnt
from records
group by ip_address
) T1
) T2
group by case when pct > 0.01 then ip_address else 'All others' end
Upvotes: 0
Reputation: 61925
Use union all
, of course.. but that doesn't answer "the problem".
This issue is the the second query "returns multiple" (just like the first query) because the group by
is by IP, of which there are many. That is, there is a resulting tuple per group, independent of any operation in the select output clause.
The desired goal is probably to sum-the-counts with an outer select.
-- union all
select "All Others", sum(t.ct)
from (
select count(ip_address) as ct
from records
group by ip_address
-- note: <=, and not <, is inverse of >
having count(ip_address) <= (select count(ip_address) from records) * .01
) t
Of course if the 'total' and 'found' are known then the 'others' is 'total' - 'found'.
The count being sequential, while an interesting observation, is irrelevant. Remember that SQL can return rows in whatever order it feels like when there is no order by
applied to the materialized result-set (order by
in sub-selects are not strictly guaranteed).
Upvotes: 1
Reputation: 1780
Can you use a variable to hold this information?
DECLARE @num INT
SET @num = (select count(*)
from records
group by ip_address
having count(*) > (select count(ip_address) from records) * .01)
Then do your regular query
select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
UNION
select "All Others", count(ip_address)-@num
from records
Upvotes: 1