Reputation: 381
What is a regex which finds all instances of all words which appear more than once in a string (not necessarily appearing consecutively)?
For example, in the string:
How much wood could a woodchuck chuck if a woodchuck could chuck wood? A woodchuck would chuck all the wood he could chuck if a woodchuck could chuck wood.
It would find every instance of duplicate word(s); in the sample above, it would find the following words: "wood","could","a","woodchuck","chuck","if"
.
I have scoured the internet for such a regex to no avail. One would think this would be what all of the questions regarding "finding a duplicate using regex" on SO would be talking about, but they all only talk about adjacent words like "the the".
Upvotes: 2
Views: 1156
Reputation: 66883
One way to do the key part with regex is by using a feature that allows code execution inside regex, to build a frequency hash (map, dictionary). That can then be used to select words that repeat.
Doing it all in regex alone, not ever using any language or tool, isn't really possible (except perhaps using recursion if supported). The rest of this post employs basics of a programming language, in the context of the regex feature that allows code execution in regex.
A feature that I think is available to most engines is to run code in order to prepare the replacement string. I use Perl for the following short samples. What is of interest here is done using a side effect, what of course isn't optimal (and usually results in clumsy looking code).
Either run the normal substitution and put back the matched word
$string =~ s/(\w+)/++$freq{$1}; $1/eg;
or use the "non-destructive" substitution, which doesn't change the target (and returns the changed string or the original if the match failed)
my $toss = $string =~ s/(\w+)/++$freq{$1}/egr;
The returned string is unneeded for the task as described. In both cases we run a substitution on each word when this isn't what we actually need.
Then print keys (words) with frequencies larger than 1
foreach my $word (keys %freq) { say $word if $freq{$word} > 1 }
The regex matchs "words" per \w
character class; adjust as suitable for your need.
Altogether, since this is a tricky task for a regex I'd recommend to split the string into words and count off duplicates using your language's features, rather than push the regex. Any language worth its salt can do this rather elegantly and efficiently.
With Perl, another way would be to use the embedded code construct, that allows code to run in the matching part. As far as I know that is not available in other languages, save for some libraries.
One can run code in the matching part like so
my @matches = $string =~ /(\w+)(?{++$freq{$1}})/g;
where the construct (?{code})
will execute embedded code, here building the frequency hash. Using this feature (safely) requires close reading of documentation.
The @matches
above has all words, and is not of interest in the problem as stated; it is used here to put the regex's match operator in the "list context" so that the search continues through the string via the /g
modifier.
Upvotes: 2
Reputation: 385754
You would need the following:
\b\w+\b
(?: (?= .* \b(\1)\b )
| (?<= \b\1\b .* \1 )
)
(Make sure that .
can match any character in the engine you are using. Adjust as needed.)
You haven't specified a regex engine, but I can't think of any that support variable-width lookbehinds.[1] Yet, this is required to achieve what you want.
It's also horribly slow, having O(N^2) time in terms of words.[2]
Ok, someone showed that Variable-Length Lookbehinds: actually possible in Perl/PCRE! They used recursion to step back one character at a time. Have fun with that.
One would normally use two passes, one to find the duplicates, and one to do the "finding".
my %seen;
my @dups = grep ++$seen{$_} == 2, $file =~ /\w+/g;
my $alt = join "|", @dups;
$file =~ s/\b($alt)\b/<$&>/g;
This is O(N) in terms of words.
Technically, starting in Perl 5.30, lookbehinds "can handle variable lengths from 1 to 255 characters as an experimental feature." This is far too small for the OP, who spoke of in terms of GB in a now-deleted comment.
Imagine you have a document with N words, all different.
O( (N-1)+0 + (N-2)+1 + ... + 1+(N-2) + 0+(N-1) )
= O( [ (N-1)+(N-2)+...+1+0 ] + [ 0+1+...+(N-2)+(N-1) ] )
= O( [ 0+1+...+(N-2)+(N-1) ] * 2 )
= O( 1+...+(N-2)+(N-1) ) # Constant factors irrelevant in O()
= O( (N-1) * ((N-1)+1) / 2 ) # 1+2+..x == x*(x+1)/2
= O( (N-1) * N / 2 )
= O( (N-1) * N ) # Constant factors irrelevant in O()
= O( N^2 - N )
= O( N^2 ) # An N term is subsumed by a N^2 term
Upvotes: 2
Reputation: 6798
Perhaps following demo code is easy to understand
use strict;
use warnings;
use feature 'say';
my $text = do { local $/, <DATA> };
my(%count,@found,$regex);
$count{$_}++ for split '[ .,?]', $text;
$count{$_} > 1 && push @found, $_ for keys %count;
$regex = join('|',@found);
say 'Before: ' . $text;
$text =~ s/\b($regex)\b/<$1>/g;
say 'After: ' . $text;
__DATA__
How much wood could a woodchuck chuck if a woodchuck could chuck wood? A woodchuck would chuck all the wood he could chuck if a woodchuck could chuck wood.
Output
Before: How much wood could a woodchuck chuck if a woodchuck could chuck wood? A woodchuck would chuck all the wood he could chuck if a woodchuck could chuck wood.
After: How much <wood> <could> <a> <woodchuck> <chuck> <if> <a> <woodchuck> <could> <chuck> <wood>? A <woodchuck> would <chuck> all the <wood> he <could> <chuck> <if> <a> <woodchuck> <could> <chuck> <wood>.
Upvotes: 1