Frank
Frank

Reputation: 55

bigquery mapping tables using LIKE with duplicate rows

based on this question: bigquery update table using LIKE returns "UPDATE/MERGE must match at most one source row for each target row" I came up with a follow-up question that might end up in a complete different solution. That's why I posted a new question rather than a comment.

I am referring to the solution posted by @jon-armstrong. After testing it with different data, there is still the issue that it does not work if there are duplicate rows in 'table1'. Of course this problem comes from the 'GROUP BY' statement - and w/o this, the UPDATE query does not work, resulting in the error message stated in my original question. It doesn't work either, if I 'GROUP' every value, or group nothing as suggested here. I also came up with the idea of using 'PARTITION BY', however, I get a syntax error in BigQuery.

There can be duplicates in my 'table1' (Data) and my mapping table 'table2'. So to make it very precise, this is my goal:

Table1 (data table)

textWithFoundItemInIt         | foundItem
-------------------------------------------
hallo Adam                    |  
Bert says hello               | 
Bert says byebye              | 
Want to find "Caesar"bdjehg   |
Want to find "Caesar"bdjehg   |
Want to find "Caesar"again    |
Want to find "Caesar"again and also Bert    | <== It is no problem, if only MAX()=Caesar or MIN()=Bert name is found. 
Want to find "CaesarCaesar"again and again | <== This is no problem, just finding one Caesar is enough 

Table2 (mapping table)

mappingItem
------------
Adam
Bert
Caesar
Bert
Caesar
Adam

Expected result

textWithFoundItemInIt         | foundItem
--------------------------------------------
hallo Adam                    |  Adam
Bert says hello               |  Bert
Bert says byebye              |  Bert
Want to find "Caesar"bdjehg   |  Caesar
Want to find "Caesar"bdjehg   |  Caesar
Want to find "Caesar"again    |  Caesar
Want to find "Caesar"again and also Bert    | Caesar [or Bert]
Want to find "CaesarCaesar"again and again | Caesar

It doesn't matter which Adam from Table2 is found and inserted into Table1, they will be the same. So it is even okay if the first Adam will be over written by the second Adam, or if the query just stops to search any further once one Adam is found.

If I execute Jon's 'SELECT' query, it would result in:

textWithFoundItemInIt         | foundItem
--------------------------------------------
hallo Adam                    |  Adam
Bert says hello               |  Bert
Bert says byebye              |  Bert
Want to find "Caesar"bdjehg   |  Caesar
Want to find "Caesar"again    |  Caesar
Want to find "Caesar"again and also Bert    | Caesar (if MAX() chosen)
Want to find "CaesarCaesar"again and again | Caesar

It (correctly) omits the second "Want to find "Caesar"bdjehg", but that's unfortunately not what I need.

If it is easier, it would also be okay that in cases that two names are found in one row

textWithFoundItemInIt         | foundItem
---------------------------------------------
hallo Adam and Bert           |  Adam, Bert 
Bert says hello to Caesar     |  Bert, Caesar

or

textWithFoundItemInIt         | foundItem1      | foundItem2
---------------------------------------------------------------
hallo Adam and Bert           |  Adam           | Bert 
Bert says hello to Caesar     |  Bert           | Caesar

I hope this helps to understand my issue. In easy words: "It's just a mapping with multiple equal rows" ;-)

Thanks a lot :)

Upvotes: 0

Views: 599

Answers (2)

Frank
Frank

Reputation: 55

The SELECT statement from @Mikhail works very well. But when I put it into the UPDATE statement, I get my well-known error:

"UPDATE/MERGE must match at most one source row for each target row".

The problem occurs, because the SELECT statement correctly returns duplicates. A simple solution to this issue is to SELECT DISTINCT. If done so, there is no error anymore.

If more than one regexp should be found, then this query is helpful:

select textWithFoundItemInIt, 
ARRAY_TO_STRING(regexp_extract_all(textWithFoundItemInIt, r'(?i)' || mappingItems), " --- ") AS foundItem
from table1, (select string_agg(mappingItem, '|') mappingItems from table2) 

I hope my logic using the DISTINCT statement is fail-proof and feasible in all cases. If anyone has any remarks, I am very happy for feedback.

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

Consider below approach

select textWithFoundItemInIt, 
  regexp_extract(textWithFoundItemInIt, r'(?i)' || mappingItems) foundItem
from table1, (select string_agg(mappingItem, '|') mappingItems from table2)    

if applied to sample data in your question - output is

enter image description here

Upvotes: 2

Related Questions