Phil
Phil

Reputation: 1927

Java Regex: Replace character unless preceded by other character

I am using Java and Regular Expressions and need to split some data into multiple entities. In my input a single quote character (') specifies an end of entity UNLESS its preceded by the escape character which is a question mark (?).

My RegEx is (?<!\\?)\\' and I'm using a Scanner to split the input into separate entities. So the following cases work correctly:

Hello'There  becomes 2 entities: Hello and There
Hello?'There remains 1 entity:   Hello?'There

However when I encounter the case where I want to escape the question mark it doesn't work. So:

Hello??'There     should become 2 entities:   Hello?? and There
Hello???'There    should become 1 entity:     Hello???'There
Hello????'There   should become 2 entities:   Hello???? and There
Hello?????'There  should become 1 entity:     Hello????'There
Hello?????There   should become 1 entity:     Hello????There
Hello??????There  should become 1 entity:     Hello?????There

Thus the rule is if there are an even number of question marks, followed by a quote, it should be split. If there are an odd number of question marks then it should not split.

Can someone help fix my Regex (hopefully with an explanation!) to cope with the multiple cases?

Thanks,

Phil

Upvotes: 4

Views: 1837

Answers (3)

Alan Moore
Alan Moore

Reputation: 75222

Don't use split() for this. That seems like the obvious solution, but it's much easier to match the entities themselves than it is to match the delimiters. Most of the regex-enabled languages have built-in methods for this, like Python's findall() or Ruby's scan(), but in Java we're still stuck with writing boilerplate. Here's an example:

Pattern p = Pattern.compile("([^?']|\\?.)+");
String[] inputs = {
    "Hello??'There",
    "Hello???'There",
    "Hello????'There",
    "Hello?????'There",
    "Hello?????There",
    "Hello??????There"
};
for (String s : inputs)
{
  System.out.printf("%n%s :%n", s);
  Matcher m = p.matcher(s);
  while (m.find())
  {
    System.out.printf("  %s%n", m.group());
  }
}

output:

Hello??'There :
  Hello??
  There

Hello???'There :
  Hello???'There

Hello????'There :
  Hello????
  There

Hello?????'There :
  Hello?????'There

Hello?????There :
  Hello?????There

Hello??????There :
  Hello??????There

The arbitrary-max-length gimmick Thomas used, besides being a disgusting hack (no offense intended, Thomas!), is unreliable because they keep introducing bugs into the Pattern.java code that handles that stuff. But don't think of this solution as another workaround; lookbehinds should never be your first resort, even in flavors like .NET where they work reliably and restriction-free.

Upvotes: 2

Chris B
Chris B

Reputation: 925

Are you sure you want to use regular expressions? If your string will be relatively small and/or execution time isn't a big issue you could use a String Builder and a loop to count the number of "?" e.g.

    //Your String
    String x = "Hello??'World'Hello?'World";
    StringBuilder sb = new StringBuilder();
    //Holds your splits
    ArrayList<String> parts = new ArrayList<String>();

    int questionmarkcount = 0;
    int _isEven;

    for (char c : x.toCharArray()) {
        if (c == '?') {
            questionmarkcount++;
            sb.append(c);
        } else if (c == '\'') {
            _isEven = questionmarkcount % 2;
            //if there are an even number of '? or none
            if (_isEven == 0 || questionmarkcount == 0) {
                //add the current split, reset the ? count and clear the String builder
                parts.add(sb.toString());
                sb.delete(0, sb.length());
                questionmarkcount = 0;
            } else {
                //append the question mark, no split is needed
                sb.append(c);
                //start counting from the beginning
                questionmarkcount = 0;
            }
        } else {
            sb.append(c);
        }
    }
    parts.add(sb.toString());

By the end of the loop the parts ArrayList would hold all of your splits. The current code will split if there are an EVEN number of question marks preceding the '.

Upvotes: 0

Thomas
Thomas

Reputation: 88707

Try this expression to match even cases: (?<=[^\?](?>\?\?){0,1000})'

  • (?<=...)' is a positive look behing, i.e. every ' which is preceded by the expression between (?<= and ) will match
  • (?>\?\?) is an atomic group of 2 consecutive question marks
  • (?>\?\?){0,1000} means there can be 0 to 1000 of those groups. Note that you can't write (?>\?\?)* since the expression needs to have a maximum length (a maximum number of groups). However, you should be able to increase the upper bound by a lot, depending on the rest of the expression
  • [^\?](?>\?\?)... means the groups of 2 question marks must be preceded by some character but not a question mark (otherwise you'd match the odd case)

Upvotes: 3

Related Questions