Chrisvdberge
Chrisvdberge

Reputation: 1966

BigQuery check if string values contain elements of an array or subquery

I'm struggling a bit to reason how to best go about this. I have a table that has 1 column (b) of values I like to check against. The main table I am querying has a column (a) with string values, for which I want to check if they contain elements from the table with column b

Right now I'm only checking for literal matches;

SELECT CASE WHEN lower(table1.a) IN (SELECT lower(table2.b) FROM table2) THEN 1 ELSE 0 END
FROM table1

However the values in column b can contain more words other than an exact match. Now I thought I could use SPLIT to check each word in the string, but not sure if that's the best/only way to go. Especially since it varies per row how many words the string even has.

Something like;

DECLARE match_against as ARRAY;
SET match_against = SELECT lower(table2.b) FROM table2;
SELECT CASE WHEN lower(split(table1.a, ' ')[SAFE_ORDINAL(0)] IN match_against 
OR lower(split(table1.a, ' ')[SAFE_ORDINAL(1)] IN match_against 
OR lower(split(table1.a, ' ')[SAFE_ORDINAL(2)] IN match_against 
THEN 1 ELSE 0 END 
FROM table1;

Especially if column a in table2 is containing strings with multiple words this becomes unworkable. (check for any match between a word from the string in column a and any word in the strings from column b)

any suggestions on how to achieve this?

Some example data: table1 column a values: 'abc', 'aBc dEf', 'abc def ghij' table2 column b values: 'def'

Expected result is 0,1,1

Upvotes: 0

Views: 4267

Answers (1)

Samuel
Samuel

Reputation: 3538

A left join is a good way to comnbine two tables. The CONTAINS_SUBSTR does not accept a column as seach parameter, therefore, we use a UDF:

create temp function contains_str(a string,b string ) returns bool language js as 
"""
return a.includes(b);
""";

with tbl_a as (select a from unnest([ 'abc', 'aBc dEf', 'abc def ghij',"noabc,"cats","cat"])a),
tbl_b as (select b from unnest([ 'dEf', 'abc',"cat"])b)
select *,
contains_str(a,b)
 from tbl_a
left join tbl_b
on contains_str(concat(" ",lower(a)," "),concat(" ",lower(b)," "))

We add spaces around each search. Thus noabc is not matched by abc and cats not by cat.

Each entry of table A is repeated for each match to table B.

Another way is following. It only give the entries of table A which fullfill a match in table B, but it cannnot return additional columns from table B. The soundex helps to find matches. Please read documentation.

create temp function contains_str(a string,b string ) returns bool language js as 
"""
return a.includes(b);
""";

with tbl_a as (select a from unnest([ 'abc', 'aBc dEf', 'abc def ghij',"cats","cat"])a),
tbl_b as (select b from unnest([ 'dEf', 'abc',"cat"])b)
select *,
(select sum(if(contains_str(concat(" ",(lower(a))," "),concat(" ",(lower(b))," ")),1,0)) from tbl_b )

 from tbl_a
# where exists (select if(contains_str(concat(" ",lower(a)," "),concat(" ",lower(b)," ")),1,null) from tbl_b )


Upvotes: 1

Related Questions