alex
alex

Reputation: 1917

BigQuery - remove duplicate substrings from string

I have a string which contains repeated substrings like

The big black dog big black dog is a friendly friendly dog who who lives nearby nearby

How can I remove these in BQ so that the result looks like this

The big black dog is a friendly dog who lives nearby

I tried with regex capturing groups like in this question but no luck so far, it just returns the original phrase

with text as (
  select "The big black dog big black dog is a friendly 
        friendly dog who who lives nearby nearby" as phrase
)

select REGEXP_REPLACE(phrase,r'([ \\w]+)\\1',r'$1') from text

Edit: assuming the repeated word or phrase follows right after the first instance of a word or phrase

i.e. dog dog is considered a repetition of dog, but dog good dog is not

similarly for phrases:

good dog good dog is repetition of good dog

but good dog catch ball good dog is not considered a repetition

Upvotes: 0

Views: 616

Answers (2)

Jaytiger
Jaytiger

Reputation: 12254

In addition to @Mikhail's wonderful SQL-ish approach, you might consider below regexp JS UDF approach.

regular expression explanation

CREATE TEMP FUNCTION dedup_repetition(s STRING) RETURNS STRING LANGUAGE js AS
"""
  const re = /(.{2,}) (?=\\1)/g;
  return s.replace(re, '');
""";

WITH sample_data AS (
  SELECT 'The big black dog big black dog is a friendly friendly dog who who lives nearby nearby' phrase UNION ALL
  SELECT 'good dog good dog' UNION ALL
  SELECT 'good dog good dog good dog' UNION ALL
  SELECT 'good dog good dog good' UNION ALL
  SELECT 'dog dog dog' UNION ALL
  SELECT 'good dog catch ball good dog'
)
SELECT dedup_repetition(phrase) deduped_phrase FROM sample_data;

enter image description here

Upvotes: 2

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Consider below naive approach

with recursive candidates as (
  select id, phrase, offset_a, len, string_agg(word_b, ' ' order by offset_b) seq
  from your_table, unnest([0,1,2,3,4,5]) as len,
  unnest(split(phrase, ' ')) word_a with offset offset_a  
  join unnest(split(phrase, ' ')) word_b with offset offset_b
  on offset_b between offset_a and offset_a + len  
  group by id, phrase, offset_a, len
), dups as (
  select id,phrase, seq, row_number() over(partition by phrase order by array_length(split(seq, ' ')) desc) as pos
  from (
    select distinct t1.id, t1.phrase, t1.offset_a, t1.seq
    from candidates t1 
    join candidates t2
    on t1.seq = t2.seq
    and t1.phrase = t2.phrase
    and t1.offset_a = t2.offset_a + t2.len + 1
  )
), iterations as (
  select id, phrase, 1 as pos, phrase || ' ' as dedupped_phrase from your_table
  union all 
  select i.id, i.phrase, pos + 1, regexp_replace(dedupped_phrase, r'(' || seq || ' )+', '' || seq || ' ')
  from iterations i
  join dups d
  using(pos, phrase, id)
) 
select id, phrase, trim(dedupped_phrase) as dedupped_phrase
from iterations
qualify 1 = row_number() over(partition by phrase order by pos desc)

So, assuming your_table is

enter image description here

the output is

enter image description here

  • Step 1 - CTE named candidates - here you are getting all possible words sequences to then consider in next step. Note: here in unnest([0,1,2,3,4,5]) len you are setting possible length of sequebces (number of words) to consider. you can use unnest(generate_array(0,5)) as len instead if you want to save some typing :o)
  • Step 2 - CTE named dups identifies duplicate sequences obtained in previous step and ranks them from most lengthy (by words count) to least ones
  • Step 3- CTE named iterations actually goes and eliminates dups one by one (for every phrase)
  • Step 4 - finally, the last SELECT retrieves only last iterations for each initial phrase

Upvotes: 2

Related Questions