Malonge
Malonge

Reputation: 2040

Regex Multiple Strings With "or" Operator

I need to establish a java regex that will recognize the following 3 cases:

  1. Any combination/amount of the following characters: "ACTGactg:"

or

  1. Any single question marks "?"

or

  1. Any string "NTC"

I will list what I have tried so far and the errors that have arisen.

public static final VALID_STRING = "[ACTGactg:]*";
// Matches the first case but not the second or third
// as expected.

public static final VALID_STRING = "\\?|[ACTGactg:]*";
// Matches all 3 conditions when my understanding leads me to
// believe that it should not except the third case of "NTC"

public static final VALID_STRING = "?|[ACTGactg:]*";
// Yields PatternSyntaxException dangling metacharacter ?

What I would expect to be accurate is the following:

public static final VALID_STRING = "NTC|\\?|[ACTGacgt:]*";

But I want to make sure that if I take away the "NTC" that any "NTC" string will appear as invalid.

Here is the method I am using to test these regexs.

private static boolean isValid(String thisString){
    boolean valid = false;
    Pattern checkRegex = Pattern.compile(VALID_STRING);
    Matcher matchRegex = checkRegex.matcher(thisString);
    while (matchRegex.find()){
        if (matchRegex.group().length != 0){
            valid = true;
        }
    }
    return valid;
}

So here are my closing questions:

  1. Could the "\\?" regex possible be acting as a wild card character that is accepting the "NTC" string?

  2. Are the or operators "|" appropriate here?

  3. Do I need to make use of parenthesis when using these or operators?

Here are some example incoming strings:

Thank you

Upvotes: 0

Views: 1065

Answers (2)

ntrp
ntrp

Reputation: 401

Yes the provided regex would be ok:

public static final VALID_STRING = "NTC|\\?|[ACTGacgt:]+";

...

boolean valid = str.matches(VALID_STRING);

If your remove NTC| from the regex the string NTC becomes invalid.

You can test it and experiment yourself here.

Upvotes: 2

RealSkeptic
RealSkeptic

Reputation: 34608

Since you are using the Matcher.find() method, you are looking for your pattern anywhere in the string.

This means the strings A:C, T:G, AA:CC etc. match in their entirety. But how about NTC?

It matches because find() looks for a match anywhere. the TC part of it matches, therefore you get true.

If you want to match only the strings in their entirety, either use the match() method, or use ^ and $.

Note that you don't have to check that the match is longer than 0, if you change your pattern to [ACTGactg:]+ instead of [ACTGactg:]*.

Upvotes: 2

Related Questions