ramd
ramd

Reputation: 347

Query to check each element in a list against a column in an BigQuery table and select a record if none of the item is present from that list

Struggling to come up with a query as I am fairly new to the BigQuery/SQL world.
So I need to check if none of the items from a given list of keywords(an external source) is present in the table which can be a combination of keywords from this list of array.
Let me try to explain with an example:
Let's say I have an external source of a list of keywords: [foo, bar, foo1 bar, foobar, foo2 bar1, foo1 foobar, foo foo1 foo2, etc…] (each element can have multiple space-separated words)
In one of the big query tables, we have an array type col which is a combination of one or more keywords from the above list and this also is space-separated(ideally it should have been a multi-element list but unfortunately it is just one element space-separated keywords but that’s the data issue I have to live with it now):

column_name
foo
foo1 bar foobar
foobar foo2 bar1
bar foo1
baz foobar
etc

So, from the above example:
“foo” is in the list;
“foo1 bar”, “foobar” is in list;
“foobar”, “foo2 bar1” is in list;
“foobar”, “foo2 bar1” is in list.
“bar” is in the list but “foo” is not or if you want to consider even “bar foo1” is not in the list, I want to select this record
“baz” is not in the list, I want to select this record as well

Can you please help me with how to achieve this?

Upvotes: 1

Views: 3689

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Consider below approach

select *
from (
  select column_name, 
    array(
      select word
      from unnest(split(column_name, ' ')) word
      left join unnest(arr) phrase
      group by word
      having countif(regexp_contains(phrase, word)) = 0
    ) items_not_present_in_lists
  from (
    select *, array (
        select phrase from phrases 
        where regexp_contains(column_name, phrase)
      ) arr
    from your_table
  )
)
where array_length(items_not_present_in_lists) > 0

if applied to sample data in your question - output is

enter image description here

I was using below dummy data for testing/playing with (so you can too)

with your_table as (
  select 'foo' column_name union all 
  select 'foo1 bar foobar' union all 
  select 'foobar foo2 bar1' union all 
  select 'bar foo1' union all 
  select 'baz foobar' union all 
  select 'etc' 
), phrases as (
  select phrase
  from unnest(['foo', 'bar', 'foo1 bar', 'foobar', 'foo2 bar1', 'foo1 foobar', 'foo foo1 foo2', 'etc']) phrase
)

Upvotes: 3

Related Questions