Reputation: 1927
I am using Java and Regular Expressions and need to split some data into multiple entities. In my input a single quote character (') specifies an end of entity UNLESS its preceded by the escape character which is a question mark (?).
My RegEx is (?<!\\?)\\'
and I'm using a Scanner to split the input into separate entities. So the following cases work correctly:
Hello'There becomes 2 entities: Hello and There
Hello?'There remains 1 entity: Hello?'There
However when I encounter the case where I want to escape the question mark it doesn't work. So:
Hello??'There should become 2 entities: Hello?? and There
Hello???'There should become 1 entity: Hello???'There
Hello????'There should become 2 entities: Hello???? and There
Hello?????'There should become 1 entity: Hello????'There
Hello?????There should become 1 entity: Hello????There
Hello??????There should become 1 entity: Hello?????There
Thus the rule is if there are an even number of question marks, followed by a quote, it should be split. If there are an odd number of question marks then it should not split.
Can someone help fix my Regex (hopefully with an explanation!) to cope with the multiple cases?
Thanks,
Phil
Upvotes: 4
Views: 1837
Reputation: 75222
Don't use split()
for this. That seems like the obvious solution, but it's much easier to match the entities themselves than it is to match the delimiters. Most of the regex-enabled languages have built-in methods for this, like Python's findall()
or Ruby's scan()
, but in Java we're still stuck with writing boilerplate. Here's an example:
Pattern p = Pattern.compile("([^?']|\\?.)+");
String[] inputs = {
"Hello??'There",
"Hello???'There",
"Hello????'There",
"Hello?????'There",
"Hello?????There",
"Hello??????There"
};
for (String s : inputs)
{
System.out.printf("%n%s :%n", s);
Matcher m = p.matcher(s);
while (m.find())
{
System.out.printf(" %s%n", m.group());
}
}
output:
Hello??'There :
Hello??
There
Hello???'There :
Hello???'There
Hello????'There :
Hello????
There
Hello?????'There :
Hello?????'There
Hello?????There :
Hello?????There
Hello??????There :
Hello??????There
The arbitrary-max-length gimmick Thomas used, besides being a disgusting hack (no offense intended, Thomas!), is unreliable because they keep introducing bugs into the Pattern.java code that handles that stuff. But don't think of this solution as another workaround; lookbehinds should never be your first resort, even in flavors like .NET where they work reliably and restriction-free.
Upvotes: 2
Reputation: 925
Are you sure you want to use regular expressions? If your string will be relatively small and/or execution time isn't a big issue you could use a String Builder and a loop to count the number of "?" e.g.
//Your String
String x = "Hello??'World'Hello?'World";
StringBuilder sb = new StringBuilder();
//Holds your splits
ArrayList<String> parts = new ArrayList<String>();
int questionmarkcount = 0;
int _isEven;
for (char c : x.toCharArray()) {
if (c == '?') {
questionmarkcount++;
sb.append(c);
} else if (c == '\'') {
_isEven = questionmarkcount % 2;
//if there are an even number of '? or none
if (_isEven == 0 || questionmarkcount == 0) {
//add the current split, reset the ? count and clear the String builder
parts.add(sb.toString());
sb.delete(0, sb.length());
questionmarkcount = 0;
} else {
//append the question mark, no split is needed
sb.append(c);
//start counting from the beginning
questionmarkcount = 0;
}
} else {
sb.append(c);
}
}
parts.add(sb.toString());
By the end of the loop the parts ArrayList would hold all of your splits. The current code will split if there are an EVEN number of question marks preceding the '.
Upvotes: 0
Reputation: 88707
Try this expression to match even cases: (?<=[^\?](?>\?\?){0,1000})'
(?<=...)'
is a positive look behing, i.e. every '
which is preceded by the expression between (?<=
and )
will match(?>\?\?)
is an atomic group of 2 consecutive question marks(?>\?\?){0,1000}
means there can be 0 to 1000 of those groups. Note that you can't write (?>\?\?)*
since the expression needs to have a maximum length (a maximum number of groups). However, you should be able to increase the upper bound by a lot, depending on the rest of the expression[^\?](?>\?\?)...
means the groups of 2 question marks must be preceded by some character but not a question mark (otherwise you'd match the odd case)Upvotes: 3