Reputation: 11
I have three tables in Google Bigquery:
t1) ID1, ID2
t2) ID1, Keywords (500.000 rows)
t3) ID2, Keywords (3 million rows)
The observations of ID1 have been matched/linked with observations in ID2, each observation has a number of keywords.
I want to know about the overlap in keywords between the matched ID1's and ID2's.
t1
┌─────────────┐
│ ID1 │ ID2 │
├──────┼──────┤
│ 1 │ A │
│ 1 │ B │
│ 1 │ C │
│ 1 │ D │
│ 2 │ E │
│ 2 │ F │
│ 2 │ G │
│ 2 │ H │
│ 3 │ I │
│ 3 │ J │
│ 3 │ K │
│ 3 │ L │
│ 4 │ M │
│ 4 │ N │
│ 4 │ O │
│ 4 │ P │
t2
┌──────────────────────┐
│ TABLE 2 │
├──────────────────────┤
│ ID1 │ KEYWORD │
│ 1 │ KEYWORD 1 │
│ 1 │ KEYWORD 2 │
│ 1 │ KEYWORD 3 │
│ 1 │ KEYWORD 4 │
│ 2 │ KEYWORD 2 │
│ 2 │ KEYWORD 3 │
│ 2 │ KEYWORD 6 │
│ 2 │ KEYWORD 8 │
│ 3 │ KEYWORD 10 │
│ 3 │ KEYWORD 64 │
│ 3 │ KEYWORD 42 │
│ 3 │ KEYWORD 39 │
│ 4 │ KEYWORD 18 │
│ 4 │ KEYWORD 33 │
│ 4 │ KEYWORD 52 │
│ 4 │ KEYWORD 24 │
└─────────┴────────────┘
t3
┌───────────────────────┐
│ TABLE 3 │
├───────────────────────┤
│ ID2 │ KEYWORD │
│ A │ KEYWORD 1 │
│ A │ KEYWORD 2 │
│ A │ KEYWORD 54 │
│ A │ KEYWORD 34 │
│ B │ KEYWORD 32 │
│ B │ KEYWORD 876 │
│ B │ KEYWORD 632 │
│ B │ KEYWORD 2 │
│ K │ KEYWORD 53 │
│ K │ KEYWORD 43 │
│ K │ KEYWORD 10 │
│ K │ KEYWORD 64 │
│ P │ KEYWORD 56 │
│ P │ KEYWORD 44 │
│ P │ KEYWORD 322 │
│ P │ KEYWORD 99 │
└─────────┴─────────────┘
As the tables show, ID1 (1) is matched to ID2 (A). Both ID1 and ID2 have a KEYWORD 1 and KEYWORD 2, so there's a total of 2 keywords that overlap between both matched observations, which in this case (as ID1 (A) has 4 keywords total) is 50% overlap. I am looking to make the following table, where every row in t1 gets additional columns MATCH COUNT and MATCH PERCENTAGE.
┌───────────────────────────────────────────────┐
│ RESULT │
├───────────────────────────────────────────────┤
│ ID │ ID2 │ MATCH COUNT │ MATCH PERCENTAGE │
│ 1 │ A │ 2 │ 50% │
│ 1 │ B │ 1 │ 25% │
│(...) │(...)│ (...) │ (...) │
│ 3 │ K │ 2 │ 50% │
│ 4 │ P │ 0 │ 0% │
└────────┴─────┴─────────────┴──────────────────┘
I know it is good etiquette to show what I've already done, but honestly this one is way over my head and I don't even know where to start. I am hoping that somebody can get me into the right direction.
Upvotes: 0
Views: 226
Reputation: 173106
Below is for BigQuery Standard SQL
#standardSQL
SELECT t1.id1, t1.id2,
COUNTIF(t2.keyword = t3.keyword) match_count,
COUNTIF(t2.keyword = t3.keyword) / COUNT(DISTINCT t2.keyword) match_percentage
FROM t2 CROSS JOIN t3
JOIN t1 ON t1.id1 = t2.id1 AND t1.id2 = t3.id2
GROUP BY t1.id1, t1.id2
-- ORDER BY t1.id1, t1.id2
with result as below
Row id1 id2 match_count match_percentage
1 1 A 2 0.5
2 1 B 1 0.25
3 3 K 2 0.5
4 4 P 0 0.0
Upvotes: 0
Reputation: 91
I think it is solution:
select Id1, Id2, Sum(Match) Match, Sum(Match) / Sum(Total) as Perc
from (
select t2.Id1, t2.Id2, Decode(t1.Keyword, t3.Keyword, 1, 0) Match, 1 Total
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t2.Id2 = t3.Id2)
)
group by Id1, Id2
if you don't have function Decode you can use case:
case when t1.Keyword = t3.Keyword then 1 else 0 end
Easier:
select t1.Id1, t1.Id2, Sum(case when t2.Keyword = t3.Keyword then 1 else 0 end) Match, Sum(case when t2.Keyword = t3.Keyword then 1 else 0 end) / Count(1) Perc
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t1.Id2 = t3.Id2)
group by t1.Id1, t1.Id2
Google have function CountIf, you can use also:
select t1.Id1, t1.Id2, CountIf(t2.Keyword = t3.Keyword) Match, CountIf(t2.Keyword = t3.Keyword) / Count(1) Perc
from t2
inner join t1 on (t2.Id1 = t1.Id1)
inner join t3 on (t1.Id2 = t3.Id2)
group by t1.Id1, t1.Id2
Upvotes: 0
Reputation: 1270713
You can do this using join
and group by
:
select t1.id1, t2.id2
count(t3.keyword) as num_matches,
count(t3.keyword) / count(*) as proportion_matches
from t1 left join
t2
on t1.id1 = t2.id1 left join
t3
on t1.id2 = t3.id2 and
t2.keyword = t3.keyword
group by t1.id1, t2.id2;
This assumes that the keywords are unique for each id.
Upvotes: 2