bigquery mapping tables using LIKE with duplicate rows

Question

based on this question: bigquery update table using LIKE returns "UPDATE/MERGE must match at most one source row for each target row" I came up with a follow-up question that might end up in a complete different solution. That's why I posted a new question rather than a comment.

I am referring to the solution posted by @jon-armstrong. After testing it with different data, there is still the issue that it does not work if there are duplicate rows in 'table1'. Of course this problem comes from the 'GROUP BY' statement - and w/o this, the UPDATE query does not work, resulting in the error message stated in my original question. It doesn't work either, if I 'GROUP' every value, or group nothing as suggested here. I also came up with the idea of using 'PARTITION BY', however, I get a syntax error in BigQuery.

There can be duplicates in my 'table1' (Data) and my mapping table 'table2'. So to make it very precise, this is my goal:

Table1 (data table)

textWithFoundItemInIt         | foundItem
-------------------------------------------
hallo Adam                    |  
Bert says hello               | 
Bert says byebye              | 
Want to find "Caesar"bdjehg   |
Want to find "Caesar"bdjehg   |
Want to find "Caesar"again    |
Want to find "Caesar"again and also Bert    | <== It is no problem, if only MAX()=Caesar or MIN()=Bert name is found. 
Want to find "CaesarCaesar"again and again | <== This is no problem, just finding one Caesar is enough

Table2 (mapping table)

mappingItem
------------
Adam
Bert
Caesar
Bert
Caesar
Adam

Expected result

textWithFoundItemInIt         | foundItem
--------------------------------------------
hallo Adam                    |  Adam
Bert says hello               |  Bert
Bert says byebye              |  Bert
Want to find "Caesar"bdjehg   |  Caesar
Want to find "Caesar"bdjehg   |  Caesar
Want to find "Caesar"again    |  Caesar
Want to find "Caesar"again and also Bert    | Caesar [or Bert]
Want to find "CaesarCaesar"again and again | Caesar

It doesn't matter which Adam from Table2 is found and inserted into Table1, they will be the same. So it is even okay if the first Adam will be over written by the second Adam, or if the query just stops to search any further once one Adam is found.

If I execute Jon's 'SELECT' query, it would result in:

textWithFoundItemInIt         | foundItem
--------------------------------------------
hallo Adam                    |  Adam
Bert says hello               |  Bert
Bert says byebye              |  Bert
Want to find "Caesar"bdjehg   |  Caesar
Want to find "Caesar"again    |  Caesar
Want to find "Caesar"again and also Bert    | Caesar (if MAX() chosen)
Want to find "CaesarCaesar"again and again | Caesar

It (correctly) omits the second "Want to find "Caesar"bdjehg", but that's unfortunately not what I need.

If it is easier, it would also be okay that in cases that two names are found in one row

textWithFoundItemInIt         | foundItem
---------------------------------------------
hallo Adam and Bert           |  Adam, Bert 
Bert says hello to Caesar     |  Bert, Caesar

or

textWithFoundItemInIt         | foundItem1      | foundItem2
---------------------------------------------------------------
hallo Adam and Bert           |  Adam           | Bert 
Bert says hello to Caesar     |  Bert           | Caesar

I hope this helps to understand my issue. In easy words: "It's just a mapping with multiple equal rows" ;-)

Thanks a lot :)

Mikhail Berlyant · Accepted Answer

Consider below approach

select textWithFoundItemInIt, 
  regexp_extract(textWithFoundItemInIt, r'(?i)' || mappingItems) foundItem
from table1, (select string_agg(mappingItem, '|') mappingItems from table2)

if applied to sample data in your question - output is

bigquery mapping tables using LIKE with duplicate rows

Answers (2)

Related Questions