Reputation: 59
** This question has been asked before here: BigQuery Calculating Percent Overlap Between Values Multiple Columns I am trying to redo a very similar thing, but the asnwers given do not end up working for me ***
I have a table that looks like the following:
category_name | item_id
---------------|------------
category1 | item1
category2 | item1
category3 | item1
category1 | item2
category4 | item2
category1 | item3
category5 | item3
category5 | item2
category6 | item4
category3 | item5
category3 | item6
category1 | item6
category2 | item5
category1 | item4
The first thing that I was trying to achieve was to create a 6x6 table that shows the count of overlap between each pair, which I did successfully using the following code:
select t.category_name,
countif( t2.category_name = 'category1' ) as category1,
countif( t2.category_name = 'category2' ) as category2,
countif( t2.category_name = 'category3' ) as category3,
countif( t2.category_name = 'category4' ) as category4,
countif( t2.category_name = 'category5' ) as category5
from t join
t t2
on t.item = t2.item
group by t.category_name;
However, I also want to calculate the percentage of overlap instead of count. Therefore, I tried to add the "/count(*) at the end of each line in the select clause to get the percentage of overlpa like the following:
select t.category_name,
countif( t2.category_name = 'category1' ) / count(*) as category1,
countif( t2.category_name = 'category2' ) / count(*) as category2,
countif( t2.category_name = 'category3' ) / count(*) as category3,
countif( t2.category_name = 'category4' ) / count(*) as category4,
countif( t2.category_name = 'category5' ) / count(*) as category5
from t join
t t2
on t.item = t2.item
group by t.category_name;
The problem that has occurred is that for the rows and columns with the same category_name, the percentage of overlap does not end up being 100%. For example, for category 1 vs category 1, the percentage ends up to be less than 50%. Could you tell me if you seen anything wrong with my code?
Here is the table that I am looking for:
category_name | category1 | category2 | category3 | category4 | category5 | category6
--------------------------------------------------------------------------------------------------------
category1 | 100% | 20% | 40% | 20% | 40% | 20%
category2 | 50% | 100% | 100% | 0% | 0% | 0%
category3 | 66.67% | 66.67% | 100% | 0% | 0% | 0%
category4 | 100% | 0% | 0% | 100% | 100% | 0%
category5 | 100% | 0% | 0% | 50% | 100% | 0%
category6 | 100% | 0% | 0% | 0% | 0% | 100%
Upvotes: 2
Views: 796
Reputation: 173046
Consider below approach
with temp as (
select category_name, array_agg(item) items
from your_table
group by category_name
)
select * from (
select t1.category_name, t2.category_name category,
round(100 * (select count(*) from t1.items item join t2.items item using(item)) / array_length(t1.items), 2) overlap
from temp t1, temp t2
)
pivot (min(overlap) for category in ('category1', 'category2', 'category3', 'category4', 'category5', 'category6'))
If applied to sample data in your question - output is
Upvotes: 1