Pablo
Pablo

Reputation: 140

Find Overlapping within categories in SQL Big Query

I'm trying to do something similar to this but in BigQuery. I have several users that may have 1 or more categories. I must find the overlapping within categories. Something like this:

enter image description here

What I want as result is something like this:

enter image description here

That is, for example, only one user has only category D (and no other), two users have categories 10 and 30, and so on.

The main problem is that I have a lot of categories (over 40). Previously I had done something like:

SELECT sum(cat1), sum(cat2), sum(cat3)
FROM  table
where cat1 = 0 and cat2 = 1 and cat3 = 0

That way worked, but is too manual and impossible to do it right now because I have lots of categories. Want to use BigQuery if it is possible.

Upvotes: 0

Views: 436

Answers (3)

Rajat
Rajat

Reputation: 5803

This isn't exactly what you're looking for but you can use this output as a source to an Excel Pivot or BI tool to get exactly what you want. Performing pivots on 40+ columns in SQL is doable but not fun.

select a.categories, 
       b.categories as cross_categories, 
       count(distinct a.usr) as counts
from t a
join t b on a.usr=b.usr and a.categories<> b.categories
group by a.categories, b.categories

union all

select max(categories), 
       max(categories), 
       count(distinct categories)
from t
group by usr
having count(distinct categories)=1
order by 1,2

DEMO

Excel Pivot

enter image description here

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

The main problem is that I have a lot of categories (over 40).

Consider below (BigQuery) approach - works for any reasonable amount of categories

execute immediate (
select '''
  select * from (
    select distinct t1.usr, 
      t1.categories category, t2.categories category2
    from `your_table` t1 left join `your_table` t2 
    on t1.usr = t2.usr and t1.categories != t2.categories
    union all
    select usr, any_value(categories) category, any_value(categories) category2
    from `your_table`
    group by usr
    having count(1) = 1
  )
  pivot (count(usr) cat for category2 in (''' || list || '''))
  order by category
'''
from (
  select string_agg("'" || categories || "'" order by categories) list 
  from (select distinct categories from `your_table`)
  )
)     

if applied to sample data in your question - output is

enter image description here

Upvotes: 1

Sergey Geron
Sergey Geron

Reputation: 10212

FWIW:

with mytable as (
    select 'D' as Usr, '10' as Categories union all 
    select 'E', '10' union all
    select 'E', '30' union all
    select 'F', '30' union all
    select 'G', '10' union all
    select 'G', '50' union all
    select 'H', '10' union all
    select 'H', '30'
)
select grp, count(*) as cnt
from (
    select Usr, string_agg(Categories order by Categories) as grp
    from mytable
    group by Usr
)
group by grp

enter image description here

Upvotes: 0

Related Questions