Reputation: 1917
I have a string which contains repeated substrings like
The big black dog big black dog is a friendly friendly dog who who lives nearby nearby
How can I remove these in BQ so that the result looks like this
The big black dog is a friendly dog who lives nearby
I tried with regex capturing groups like in this question but no luck so far, it just returns the original phrase
with text as (
select "The big black dog big black dog is a friendly
friendly dog who who lives nearby nearby" as phrase
)
select REGEXP_REPLACE(phrase,r'([ \\w]+)\\1',r'$1') from text
Edit: assuming the repeated word or phrase follows right after the first instance of a word or phrase
i.e. dog dog
is considered a repetition of dog
,
but dog good dog
is not
similarly for phrases:
good dog good dog
is repetition of good dog
but good dog catch ball good dog
is not considered a repetition
Upvotes: 0
Views: 616
Reputation: 12254
In addition to @Mikhail's wonderful SQL-ish approach, you might consider below regexp JS UDF approach.
regular expression explanation
CREATE TEMP FUNCTION dedup_repetition(s STRING) RETURNS STRING LANGUAGE js AS
"""
const re = /(.{2,}) (?=\\1)/g;
return s.replace(re, '');
""";
WITH sample_data AS (
SELECT 'The big black dog big black dog is a friendly friendly dog who who lives nearby nearby' phrase UNION ALL
SELECT 'good dog good dog' UNION ALL
SELECT 'good dog good dog good dog' UNION ALL
SELECT 'good dog good dog good' UNION ALL
SELECT 'dog dog dog' UNION ALL
SELECT 'good dog catch ball good dog'
)
SELECT dedup_repetition(phrase) deduped_phrase FROM sample_data;
Upvotes: 2
Reputation: 172993
Consider below naive approach
with recursive candidates as (
select id, phrase, offset_a, len, string_agg(word_b, ' ' order by offset_b) seq
from your_table, unnest([0,1,2,3,4,5]) as len,
unnest(split(phrase, ' ')) word_a with offset offset_a
join unnest(split(phrase, ' ')) word_b with offset offset_b
on offset_b between offset_a and offset_a + len
group by id, phrase, offset_a, len
), dups as (
select id,phrase, seq, row_number() over(partition by phrase order by array_length(split(seq, ' ')) desc) as pos
from (
select distinct t1.id, t1.phrase, t1.offset_a, t1.seq
from candidates t1
join candidates t2
on t1.seq = t2.seq
and t1.phrase = t2.phrase
and t1.offset_a = t2.offset_a + t2.len + 1
)
), iterations as (
select id, phrase, 1 as pos, phrase || ' ' as dedupped_phrase from your_table
union all
select i.id, i.phrase, pos + 1, regexp_replace(dedupped_phrase, r'(' || seq || ' )+', '' || seq || ' ')
from iterations i
join dups d
using(pos, phrase, id)
)
select id, phrase, trim(dedupped_phrase) as dedupped_phrase
from iterations
qualify 1 = row_number() over(partition by phrase order by pos desc)
So, assuming your_table is
the output is
candidates
- here you are getting all possible words sequences to then consider in next step. Note: here in unnest([0,1,2,3,4,5]) len
you are setting possible length of sequebces (number of words) to consider. you can use unnest(generate_array(0,5)) as len
instead if you want to save some typing :o)dups
identifies duplicate sequences obtained in previous step and ranks them from most lengthy (by words count) to least onesiterations
actually goes and eliminates dups one by one (for every phrase)Upvotes: 2