user16228179
user16228179

Reputation: 59

SQL: calculate the percentage overlap

** This question has been asked before here: BigQuery Calculating Percent Overlap Between Values Multiple Columns I am trying to redo a very similar thing, but the asnwers given do not end up working for me ***

I have a table that looks like the following:

category_name  |   item_id
---------------|------------
category1      |  item1
category2      |  item1
category3      |  item1
category1      |  item2
category4      |  item2
category1      |  item3
category5      |  item3
category5      |  item2
category6      |  item4
category3      |  item5
category3      |  item6
category1      |  item6
category2      |  item5
category1      |  item4

The first thing that I was trying to achieve was to create a 6x6 table that shows the count of overlap between each pair, which I did successfully using the following code:

select t.category_name,
       countif( t2.category_name = 'category1' ) as category1,
       countif( t2.category_name = 'category2' )  as category2,
       countif( t2.category_name = 'category3' )  as category3,
       countif( t2.category_name = 'category4' )  as category4,
       countif( t2.category_name = 'category5' ) as category5
from t join
     t t2
     on t.item = t2.item
group by t.category_name;

However, I also want to calculate the percentage of overlap instead of count. Therefore, I tried to add the "/count(*) at the end of each line in the select clause to get the percentage of overlpa like the following:

select t.category_name,
       countif( t2.category_name = 'category1' ) / count(*) as category1,
       countif( t2.category_name = 'category2' ) / count(*) as category2,
       countif( t2.category_name = 'category3' ) / count(*) as category3,
       countif( t2.category_name = 'category4' ) / count(*) as category4,
       countif( t2.category_name = 'category5' ) / count(*) as category5
from t join
     t t2
     on t.item = t2.item
group by t.category_name;

The problem that has occurred is that for the rows and columns with the same category_name, the percentage of overlap does not end up being 100%. For example, for category 1 vs category 1, the percentage ends up to be less than 50%. Could you tell me if you seen anything wrong with my code?

Here is the table that I am looking for:

category_name  | category1   |  category2  |   category3   |   category4   |  category5  |  category6
--------------------------------------------------------------------------------------------------------
category1      |   100%      |     20%     |      40%      |      20%      |     40%     |     20%
category2      |    50%      |     100%    |      100%     |       0%      |      0%     |     0%
category3      |    66.67%   |     66.67%  |      100%     |       0%      |      0%     |     0%
category4      |   100%      |      0%     |       0%      |      100%     |     100%    |     0%
category5      |   100%      |      0%     |       0%      |       50%     |     100%    |     0%
category6      |   100%      |      0%     |       0%      |       0%      |      0%     |    100%

Upvotes: 2

Views: 796

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

Consider below approach

with temp as (
  select category_name, array_agg(item) items
  from your_table
  group by category_name
) 
select * from (
  select t1.category_name, t2.category_name category,
    round(100 * (select count(*) from t1.items item join t2.items item using(item)) / array_length(t1.items), 2) overlap
  from temp t1, temp t2
)
pivot (min(overlap) for category in ('category1', 'category2', 'category3', 'category4', 'category5', 'category6'))          

If applied to sample data in your question - output is

enter image description here

Upvotes: 1

Related Questions