akshit bhatia
akshit bhatia

Reputation: 395

using regex to find substring

I am facing a problem with regex usage. I am using the following regex:

\\S*the[^o\\s]*(?<!theo)\\b

The sentence that I am using is:

If the world says that theo is not oreo cookies then thetatheoder theotatheder thetatheder is extratheaterly good.

What i want from output is to have patterns: the, then, thetatheder, extratheaterly?

So in short, I am okay with 'the(The)' as a complete string or substring in a string that does not contain 'theo'.

How can I modify my regex to achieve this? What I am thinking is to apply, pipe operation or question mark. But none of them seems to be feasible.

Upvotes: 1

Views: 111

Answers (3)

The fourth bird
The fourth bird

Reputation: 163362

You might use the \S in a negative lookbehind as a start boundary and a negative lookahead to make sure the word does not contain theo.

To match The or the you could make the pattern case insensitive.

(?<!\S)(?!\S*theo\S*)\S*the\S*

In parts

  • (?<!\S) Negative lookbehind, assert what is on the left is not a non whitspace char
  • (?!\S*theo\S*) Negative lookahead, assert what is on the right does not contain theo
  • \S*the\S* Match the surrounded by matching 0+ times a non whitespace char

Regex demo

If you are only using word characters, you could also make use of word boundaries \b

\b(?!\w*theo\w*)\w*the\w*\b

Regex demo

Or you might assert that a part of the word is the and match it using an assertion that if you match a t it should not be followed by heo

\b(?=\S*the\S*)[^t\s]*(?:t(?!heo)[^t\s]*)+\b

Regex demo

Upvotes: 1

Jerald Macachor
Jerald Macachor

Reputation: 141

\b[A-Za-z]*he([a-z](?<!theo))*\b

matches the, then, extratheaterly

\b word boundary

[A-Za-z] matches any letter

[a-z] matches any lowercase letter

* matches 0 or more

([a-z](?<!theo))*

This is the tricky part. It say any letter, make sure it doesn't spell theo (looking behind) after adding that letter

Look at negative lookbehind and negative lookaheads.

Upvotes: 1

Emma
Emma

Reputation: 27723

Generic

If you want to design a general expression, maybe you can start with some expression similar to,

\S*the[^o\s]*\b

depending on what you'd like to match and not match, I guess.

Demo

Non-Generic

I guess you can simply find word boundaries (\b) helpful to solve your problem, with some simple expression similar to,

\b[Tt]he\b|\b[Tt]hen\b|\bextratheaterly\b

Demo 1

Or,

\b(?:[Tt]hen?|[Ee]xtratheaterly)\b

Demo 2

Java Test

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class RegularExpression{

    public static void main(String[] args){

        final String regex = "\\b(?:[Tt]hen?|[Ee]xtratheaterly)\\b";
        final String string = "If the world says that theo is not oreo cookies then thetatheoder is extratheaterly good.\n\n"
             + "If The world says that theo is not oreo cookies Then thetatheoder is Extratheaterly good.\n\n"
             + "If notthe world says that theo is not oreo cookies notthen thetatheoder is notextratheaterly good.\n\n\n";

        final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
        final Matcher matcher = pattern.matcher(string);

        while (matcher.find()) {
            System.out.println("Full match: " + matcher.group(0));
            for (int i = 1; i <= matcher.groupCount(); i++) {
                System.out.println("Group " + i + ": " + matcher.group(i));
            }
        }


    }
}

Output

Full match: the
Full match: then
Full match: extratheaterly
Full match: The
Full match: Then
Full match: Extratheaterly

Python Test

import re
string = '''
If the world says that theo is not oreo cookies then thetatheoder is extratheaterly good.

If The world says that theo is not oreo cookies Then thetatheoder is Extratheaterly good.

If notthe world says that theo is not oreo cookies notthen thetatheoder is notextratheaterly good.
'''

expression = r'\b(?:[Tt]hen?|[Ee]xtratheaterly)\b'

print(re.findall(expression, string))
print([m.group(0) for m in re.finditer(expression, string)])

Output

['the', 'then', 'extratheaterly', 'The', 'Then', 'Extratheaterly']
['the', 'then', 'extratheaterly', 'The', 'Then', 'Extratheaterly']

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Upvotes: 1

Related Questions