Goldentoa11
Goldentoa11

Reputation: 1760

SQL Union "All Other" Row

I've got a Sqlite database with close to 500,000 rows worth of access log information in it. I'm using it for aggregate information like "number of times each ip has hit the site", or "percentage of hits were POST", etc.

I wrote a SQL query that gathers how many times each IP address has hit the site, where the number of occurrences is greater than 1% of the count of the IP addresses.

select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01

This returns about 7 significant IP addresses. How would I go about unioning an "All Others" row to the result set?

I tried UNIONing with the logical opposite

select "All Others", count(ip_address)
from records
group by ip_address
having count(ip_address) < (select count(ip_address) from records) * .01

but this returns multiple "All Other" rows, with a count that is sequential.

Upvotes: 1

Views: 106

Answers (3)

shawnt00
shawnt00

Reputation: 17925

Without CTEs this might be best (I'm not sure what sqlite allows). Using not in prevents you from having to write the reverse of your condition which in other cases could more complicated by nulls or floating-point math considerations:

select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
union all
select 'All others', count(*)
from records
where ip_address not in (
    select ip_address /* assuming non-null ip_address */
    from records
    group by ip_address
    having count(ip_address) > (select count(ip_address) from records) * .01
)

Otherwise:

with topPercent as (
    select ip_address, count(ip_address) as addr_cnt
    from records
    group by ip_address
    having count(ip_address) > (select count(ip_address) from records) * .01
)
select ip_address, addr_cnt from topPercent
union all
select 'All others', count(distinct ip_address) - (select count(*) from topPercent)

If analytic functions were available a third option might be fastest:

select case when pct > 0.01 then ip_address else 'All others' end, sum(addr_cnt)
from (
    select ip_address, addr_cnt, addr_cnt * 1.0e / sum(addr_cnt) over () as pct
    from (
        select ip_address, count(ip_address) as addr_cnt
        from records
        group by ip_address
    ) T1
) T2
group by case when pct > 0.01 then ip_address else 'All others' end

Upvotes: 0

user2864740
user2864740

Reputation: 61925

Use union all, of course.. but that doesn't answer "the problem".

This issue is the the second query "returns multiple" (just like the first query) because the group by is by IP, of which there are many. That is, there is a resulting tuple per group, independent of any operation in the select output clause.

The desired goal is probably to sum-the-counts with an outer select.

-- union all
select "All Others", sum(t.ct)
from (
   select count(ip_address) as ct
   from records
   group by ip_address
   -- note: <=, and not <, is inverse of >
   having count(ip_address) <= (select count(ip_address) from records) * .01
   ) t

Of course if the 'total' and 'found' are known then the 'others' is 'total' - 'found'.

The count being sequential, while an interesting observation, is irrelevant. Remember that SQL can return rows in whatever order it feels like when there is no order by applied to the materialized result-set (order by in sub-selects are not strictly guaranteed).

Upvotes: 1

JustAPup
JustAPup

Reputation: 1780

Can you use a variable to hold this information?

DECLARE @num INT
SET @num = (select count(*)
             from records
             group by ip_address
             having count(*) > (select count(ip_address) from records) * .01)

Then do your regular query

select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
UNION
select "All Others", count(ip_address)-@num
from records      

Upvotes: 1

Related Questions