Reputation: 395
I am facing a problem with regex usage. I am using the following regex:
\\S*the[^o\\s]*(?<!theo)\\b
The sentence that I am using is:
If the world says that theo is not oreo cookies then thetatheoder theotatheder thetatheder is extratheaterly good.
What i want from output is to have patterns: the, then, thetatheder, extratheaterly?
So in short, I am okay with 'the(The)' as a complete string or substring in a string that does not contain 'theo'.
How can I modify my regex to achieve this? What I am thinking is to apply, pipe operation or question mark. But none of them seems to be feasible.
Upvotes: 1
Views: 111
Reputation: 163362
You might use the \S
in a negative lookbehind as a start boundary and a negative lookahead to make sure the word does not contain theo.
To match The or the you could make the pattern case insensitive.
(?<!\S)(?!\S*theo\S*)\S*the\S*
In parts
(?<!\S)
Negative lookbehind, assert what is on the left is not a non whitspace char(?!\S*theo\S*)
Negative lookahead, assert what is on the right does not contain theo
\S*the\S*
Match the
surrounded by matching 0+ times a non whitespace charIf you are only using word characters, you could also make use of word boundaries \b
\b(?!\w*theo\w*)\w*the\w*\b
Or you might assert that a part of the word is the
and match it using an assertion that if you match a t
it should not be followed by heo
\b(?=\S*the\S*)[^t\s]*(?:t(?!heo)[^t\s]*)+\b
Upvotes: 1
Reputation: 141
\b[A-Za-z]*he([a-z](?<!theo))*\b
matches the, then, extratheaterly
\b word boundary
[A-Za-z] matches any letter
[a-z] matches any lowercase letter
* matches 0 or more
([a-z](?<!theo))*
This is the tricky part. It say any letter, make sure it doesn't spell theo (looking behind) after adding that letter
Look at negative lookbehind and negative lookaheads.
Upvotes: 1
Reputation: 27723
If you want to design a general expression, maybe you can start with some expression similar to,
\S*the[^o\s]*\b
depending on what you'd like to match and not match, I guess.
I guess you can simply find word boundaries (\b
) helpful to solve your problem, with some simple expression similar to,
\b[Tt]he\b|\b[Tt]hen\b|\bextratheaterly\b
Or,
\b(?:[Tt]hen?|[Ee]xtratheaterly)\b
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "\\b(?:[Tt]hen?|[Ee]xtratheaterly)\\b";
final String string = "If the world says that theo is not oreo cookies then thetatheoder is extratheaterly good.\n\n"
+ "If The world says that theo is not oreo cookies Then thetatheoder is Extratheaterly good.\n\n"
+ "If notthe world says that theo is not oreo cookies notthen thetatheoder is notextratheaterly good.\n\n\n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Full match: the
Full match: then
Full match: extratheaterly
Full match: The
Full match: Then
Full match: Extratheaterly
import re
string = '''
If the world says that theo is not oreo cookies then thetatheoder is extratheaterly good.
If The world says that theo is not oreo cookies Then thetatheoder is Extratheaterly good.
If notthe world says that theo is not oreo cookies notthen thetatheoder is notextratheaterly good.
'''
expression = r'\b(?:[Tt]hen?|[Ee]xtratheaterly)\b'
print(re.findall(expression, string))
print([m.group(0) for m in re.finditer(expression, string)])
['the', 'then', 'extratheaterly', 'The', 'Then', 'Extratheaterly']
['the', 'then', 'extratheaterly', 'The', 'Then', 'Extratheaterly']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
jex.im visualizes regular expressions:
Upvotes: 1