Sebastian
Sebastian

Reputation: 11

Count and percentage of same strings within two groups with join

I have three tables in Google Bigquery:

t1) ID1, ID2

t2) ID1, Keywords (500.000 rows)

t3) ID2, Keywords (3 million rows)

The observations of ID1 have been matched/linked with observations in ID2, each observation has a number of keywords.

I want to know about the overlap in keywords between the matched ID1's and ID2's.

t1 ┌─────────────┐ │ ID1 │ ID2 │ ├──────┼──────┤ │ 1 │ A │ │ 1 │ B │ │ 1 │ C │ │ 1 │ D │ │ 2 │ E │ │ 2 │ F │ │ 2 │ G │ │ 2 │ H │ │ 3 │ I │ │ 3 │ J │ │ 3 │ K │ │ 3 │ L │ │ 4 │ M │ │ 4 │ N │ │ 4 │ O │ │ 4 │ P │

t2 ┌──────────────────────┐ │ TABLE 2 │ ├──────────────────────┤ │ ID1KEYWORD │ │ 1 │ KEYWORD 1 │ │ 1 │ KEYWORD 2 │ │ 1 │ KEYWORD 3 │ │ 1 │ KEYWORD 4 │ │ 2 │ KEYWORD 2 │ │ 2 │ KEYWORD 3 │ │ 2 │ KEYWORD 6 │ │ 2 │ KEYWORD 8 │ │ 3 │ KEYWORD 10 │ │ 3 │ KEYWORD 64 │ │ 3 │ KEYWORD 42 │ │ 3 │ KEYWORD 39 │ │ 4 │ KEYWORD 18 │ │ 4 │ KEYWORD 33 │ │ 4 │ KEYWORD 52 │ │ 4 │ KEYWORD 24 │ └─────────┴────────────┘

t3 ┌───────────────────────┐ │ TABLE 3 │ ├───────────────────────┤ │ ID2KEYWORD │ │ A │ KEYWORD 1 │ │ A │ KEYWORD 2 │ │ A │ KEYWORD 54 │ │ A │ KEYWORD 34 │ │ B │ KEYWORD 32 │ │ B │ KEYWORD 876 │ │ B │ KEYWORD 632 │ │ B │ KEYWORD 2 │ │ K │ KEYWORD 53 │ │ K │ KEYWORD 43 │ │ K │ KEYWORD 10 │ │ K │ KEYWORD 64 │ │ P │ KEYWORD 56 │ │ P │ KEYWORD 44 │ │ P │ KEYWORD 322 │ │ P │ KEYWORD 99 │ └─────────┴─────────────┘

As the tables show, ID1 (1) is matched to ID2 (A). Both ID1 and ID2 have a KEYWORD 1 and KEYWORD 2, so there's a total of 2 keywords that overlap between both matched observations, which in this case (as ID1 (A) has 4 keywords total) is 50% overlap. I am looking to make the following table, where every row in t1 gets additional columns MATCH COUNT and MATCH PERCENTAGE.

┌───────────────────────────────────────────────┐ │ RESULT │ ├───────────────────────────────────────────────┤ │ ID │ ID2 │ MATCH COUNT │ MATCH PERCENTAGE │ │ 1 │ A │ 2 │ 50% │ │ 1 │ B │ 1 │ 25% │ │(...) │(...)│ (...) │ (...) │ │ 3 │ K │ 2 │ 50% │ │ 4 │ P │ 0 │ 0% │ └────────┴─────┴─────────────┴──────────────────┘

I know it is good etiquette to show what I've already done, but honestly this one is way over my head and I don't even know where to start. I am hoping that somebody can get me into the right direction.

Upvotes: 0

Views: 226

Answers (3)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173106

Below is for BigQuery Standard SQL

#standardSQL
SELECT t1.id1, t1.id2, 
  COUNTIF(t2.keyword = t3.keyword) match_count,
  COUNTIF(t2.keyword = t3.keyword) / COUNT(DISTINCT t2.keyword) match_percentage
FROM t2 CROSS JOIN t3
JOIN t1 ON t1.id1 = t2.id1 AND t1.id2 = t3.id2
GROUP BY t1.id1, t1.id2
-- ORDER BY t1.id1, t1.id2  

with result as below

Row id1 id2 match_count match_percentage     
1   1   A   2           0.5  
2   1   B   1           0.25     
3   3   K   2           0.5  
4   4   P   0           0.0 

Upvotes: 0

jurden
jurden

Reputation: 91

I think it is solution:

select Id1, Id2, Sum(Match) Match, Sum(Match) / Sum(Total) as Perc
from (
    select t2.Id1, t2.Id2, Decode(t1.Keyword, t3.Keyword, 1, 0) Match, 1 Total
    from t2
         inner join t1 on (t2.Id1 = t1.Id1)
         inner join t3 on (t2.Id2 = t3.Id2)
)
group by Id1, Id2

if you don't have function Decode you can use case:

case when t1.Keyword = t3.Keyword then 1 else 0 end 

Easier:

select t1.Id1, t1.Id2, Sum(case when t2.Keyword = t3.Keyword then 1 else 0 end) Match, Sum(case when t2.Keyword = t3.Keyword then 1 else 0 end) / Count(1) Perc
from t2
     inner join t1 on (t2.Id1 = t1.Id1)
    inner join t3 on (t1.Id2 = t3.Id2)
group by t1.Id1, t1.Id2

Google have function CountIf, you can use also:

select t1.Id1, t1.Id2, CountIf(t2.Keyword = t3.Keyword) Match, CountIf(t2.Keyword = t3.Keyword) / Count(1) Perc
from t2
     inner join t1 on (t2.Id1 = t1.Id1)
    inner join t3 on (t1.Id2 = t3.Id2)
group by t1.Id1, t1.Id2

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1270713

You can do this using join and group by:

select t1.id1, t2.id2
       count(t3.keyword) as num_matches,
       count(t3.keyword) / count(*) as proportion_matches
from t1 left join
     t2
     on t1.id1 = t2.id1 left join
     t3
     on t1.id2 = t3.id2 and
        t2.keyword = t3.keyword
group by t1.id1, t2.id2;

This assumes that the keywords are unique for each id.

Upvotes: 2

Related Questions