Reputation: 157
This is fairly easy if you just want to find if a sentence contains 3 words. But, if you want to make sure it contains 3 words without repeating, and the three words in order, that’s is what I’m having a hard time finding. So a sample sentence:
This is a sample sentence to find a sample which will work for finding words
So in the above sentence, if we want to find if the sentence contains the 3 words:
That is straightforward. Something like:
sample.*?sentence.*?work
But if we want to find if they are in the exact order without repeating, that’s what I’m trying to do but it’s tricky. As you can see in the sample sentence, the word “sample” is repeated before the word “work”, so that line would not be valid. Using a lookahead I was able to see something that almost works, but the kicker is that my engine cannot use lookaheads or lookforwards.
Is this possible to do without a lookahead?
Upvotes: 1
Views: 135
Reputation: 13631
Yes, it is possible, but the regex is long, and tedious and error-prone to write.
/\bsample\b
(?:[^sw]|\B[sw]|[sw]\b|
s(?:[^ae]|[ae]\b|
a(?:[^m]|m\b|
m(?:[^p]|p\b|
p(?:[^l]|l\b|
l(?:[^e]|e\B))))|
e(?:[^n]|n\b|
n(?:[^t]|t\b|
t(?:[^e]|e\b|
e(?:[^n]|n\b|
n(?:[^c]|c\b|
c(?:[^e]|e\B))))))
)|
w(?:[^o]|o\b|
o(?:[^r]|r\b|
r(?:[^k]|k\B))
)
)*
\bsentence\b
(?:[^sw]|\B[sw]|[sw]\b|
s(?:[^ae]|[ae]\b|
a(?:[^m]|m\b|
m(?:[^p]|p\b|
p(?:[^l]|l\b|
l(?:[^e]|e\B))))|
e(?:[^n]|n\b|
n(?:[^t]|t\b|
t(?:[^e]|e\b|
e(?:[^n]|n\b|
n(?:[^c]|c\b|
c(?:[^e]|e\B))))))
)|
w(?:[^o]|o\b|
o(?:[^r]|r\b|
r(?:[^k]|k\B))
)
)*
\bwork\b/
The regex will not match part of a sentence with the words 'sample', 'sentence', or 'work' in that order, if the words 'sample', 'sentence', or 'work' appear in between the words 'sample' and 'sentence' or in between the words 'sentence' and 'work'.
Heavy use is made of the word boundary zero-width assertion \b
and its opposite \B
.
Perl example:
my $s = 'This is a sample sentence which will work for finding words'; # Should match
# my $s = 'This is a sample sentence to find a sample which will work for finding words'; # Shouldn't match
my $re = '\bsample\b(?:[^sw]|\B[sw]|[sw]\b|s(?:[^ae]|[ae]\b|a(?:[^m]|m\b|m(?:[^p]|p\b|p(?:[^l]|l\b|l(?:[^e]|e\B))))|e(?:[^n]|n\b|n(?:[^t]|t\b|t(?:[^e]|e\b|e(?:[^n]|n\b|n(?:[^c]|c\b|c(?:[^e]|e\B)))))))|w(?:[^o]|o\b|o(?:[^r]|r\b|r(?:[^k]|k\B))))*\bsentence\b(?:[^sw]|\B[sw]|[sw]\b|s(?:[^ae]|[ae]\b|a(?:[^m]|m\b|m(?:[^p]|p\b|p(?:[^l]|l\b|l(?:[^e]|e\B))))|e(?:[^n]|n\b|n(?:[^t]|t\b|t(?:[^e]|e\b|e(?:[^n]|n\b|n(?:[^c]|c\b|c(?:[^e]|e\B)))))))|w(?:[^o]|o\b|o(?:[^r]|r\b|r(?:[^k]|k\B))))*\bwork\b';
if ($s =~ $re) {
print "$&";
}
JavaScript example:
const sentences = [
// Should match:
'This is a sample sentence which will work for finding words',
'This is a sampled sample or sampled sentence which works and will work for finding words',
'This work is a sentence sample and it is a sentence which will work as a sample sentence',
// Shouldn't match:
'This is a sample sentence to find a sample which will work for finding words',
'This is a sample to work to find a sentence which will work for finding words',
'This is a sentence to find a work sample which will work for finding words',
'This is a sentence to find which sample will work for finding words',
'This is a sample sentence to find a sample which will work for finding words',
'This is a sentence to find a sample which will work for finding words',
];
const re = /\bsample\b(?:[^sw]|\B[sw]|[sw]\b|s(?:[^ae]|[ae]\b|a(?:[^m]|m\b|m(?:[^p]|p\b|p(?:[^l]|l\b|l(?:[^e]|e\B))))|e(?:[^n]|n\b|n(?:[^t]|t\b|t(?:[^e]|e\b|e(?:[^n]|n\b|n(?:[^c]|c\b|c(?:[^e]|e\B)))))))|w(?:[^o]|o\b|o(?:[^r]|r\b|r(?:[^k]|k\B))))*\bsentence\b(?:[^sw]|\B[sw]|[sw]\b|s(?:[^ae]|[ae]\b|a(?:[^m]|m\b|m(?:[^p]|p\b|p(?:[^l]|l\b|l(?:[^e]|e\B))))|e(?:[^n]|n\b|n(?:[^t]|t\b|t(?:[^e]|e\b|e(?:[^n]|n\b|n(?:[^c]|c\b|c(?:[^e]|e\B)))))))|w(?:[^o]|o\b|o(?:[^r]|r\b|r(?:[^k]|k\B))))*\bwork\b/;
for (const s of sentences) {
const m = s.match(re);
if (m != null) console.log(m[0]);
}
Upvotes: 1
Reputation: 15030
No, you can't do it without lookahead. Think about it.
I'll give you analytic/algorithmic help so you can figure it out the regex solution:
Write the regex that matches the first word.
Then append to it a regex that has a negative lookahead
to prevent matching the first word at every position as it scans for the second word.
Then append another regex that has a negative lookahead
to prevent matching the first and second words at every position as it scans for the third word.
This is totally doable. But as you can see if you can see a sequence of four works gets means yet a yet longer regex and so forth.
It would be easier with a loop construct in procedural code, especially if you need to generalize for N words.
You can’t take advantage of a Regex that supports recursion, because successive calls are not the same.
Upvotes: 0