MissouPower
MissouPower

Reputation: 95

Remove overlapping substrings within a BigQuery STRING field

I'm trying to find the most efficient way to remove overlapping substrings from a string field value on BigQuery. My use case is the same as Combining multiple regex substitutions but within BigQuery.

If I sum up the post above:

With the following list of substrings: ["quick brown fox", "fox jumps"]

I want:

A quick brown fox jumps over the lazy dog to be replaced by A over the lazy dog.

My thoughts were to come up with a JS UDF that does a similar job than what's mentioned in the post above i.e. to create a mask of the whole string and loop over the substrings to identify which characters to remove... But do you have better ideas?

Thanks for your help

Upvotes: 2

Views: 502

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172974

I couldn't find out how to do this in Standard SQL

Below is for BigQuery Standard SQL and does whole thing in one shot - just one [simple] query

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'A quick brown fox jumps over the lazy dog' text 
), list AS (
  SELECT ['quick brown fox', 'fox jumps'] phrases
)
SELECT text AS original_text, REGEXP_REPLACE(text, STRING_AGG(pattern, '|'), '') processed_text FROM (
 SELECT DISTINCT text, SUBSTR(text, MIN(start),  MAX(finish) - MIN(start) + 1) pattern FROM (
  SELECT *, COUNTIF(flag) OVER(PARTITION BY text ORDER BY start) grp FROM (
   SELECT *, start > LAG(finish) OVER(PARTITION BY text ORDER BY start) flag FROM (
    SELECT *, start + phrase_len - 1 AS finish FROM (
     SELECT *, LENGTH(cut) + 1 + OFFSET * phrase_len + IFNULL(SUM(LENGTH(cut)) OVER(win), 0) start
     FROM `project.dataset.table`, list, 
     UNNEST(phrases) phrase, 
     UNNEST([LENGTH(phrase)]) phrase_len,
     UNNEST(REGEXP_EXTRACT_ALL(text, r'(.+?)' || phrase)) cut WITH OFFSET
     WINDOW win AS (PARTITION BY text, phrase ORDER BY OFFSET ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) 
 )))) GROUP BY text, grp
) GROUP BY text

with output

Row original_text                               processed_text   
1   A quick brown fox jumps over the lazy dog   A over the lazy dog    

I tested above with few more complex / tricky texts and it still worked

Brief explanation:

  1. gather all inclusions of phrases in list and their respective starts and ends
  2. combine overlapping fragments and calculate their respective starts and ends
  3. extract new fragments based on starts and end from above step 2
  4. order DESC them by length and generate regexp expression
  5. finally do REGEXP_REPLACE using regexp generated in above step 4

Above might look messy - but in reality it does all above in one query and in pure SQL

Upvotes: 2

MissouPower
MissouPower

Reputation: 95

Using a custom JS UDF seems to work, but i've seen faster BigQuery..!

    CREATE FUNCTION `myproject.mydataset.keyword_remover_js`(label STRING) RETURNS STRING LANGUAGE js AS """

    var keywords = ["a quick brown fox", "fox jumps"] ;

    var mask = new Array(label.length).fill(1);
    var reg = new RegExp("(" + keywords.join("|") + ")", 'g');
    var found;

    while (found = reg.exec(label)) { 

        for (var i = found.index; i < reg.lastIndex; i++) {
            mask[i] = 0;
            }
        reg.lastIndex = found.index+1;
    }

    var result = []

    for (var i = 0; i < label.length; i++) {
        if (mask[i]) {
            result.push(label[i])
        }
    }

    return result.join('').replace(/ +/g,' ').replace(/^ +| +$/,'')
    """;

Upvotes: 1

Related Questions