Reputation: 19895
I am trying to solve a simple Java regex matching problem but still getting conflicting results (following up on this and that question).
More specifically, I am trying to match a repetitive text input, consisting of groups that are delimited by '|' (vertical bar) that may be directly preceded by underscore ('_'), especially if the groups are not empty (i.e., if no two consecutive | delimiters appear in the input).
An example such input is:
Text group 1_|Text group 2_|||Text group 5_|||Text group 8
In addition, I need a way to verify that a match has occurred, in order to avoid applying the processing related to that input to other, totally different inputs that my application also processes, using different regular expressions.
To confirm that a regex works, I am using RegexPal.
After several tests, the closest to what I want are the following two Regular Expressions, suggested in the questions I quoted above:
1. (?:\||^)([^\\|]*)
2. \G([^\|]+?)_?\||\G()\||\G([^\|]*)$
Using either of these, if I run a matcher.find() loop I get:
So, apparently Regex 2 is not correct (and RegexPal also does not show it as matching).
I could use Regex 1 and do some post-processing to remove the trailing underscore, although ideally I would like the regex to do that for me.
However, none of the two aforementioned regular expressions returns true for matcher.matches(), whereas matcher.find() is always true even for totally irrelevant input (reasonable, since there will often be at least 1 matching group, even in other text).
I thus have two questions:
The code used to test Regex 1, is something like
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher = Pattern.compile("(?:\\||^)([^\\\\|]*)").matcher(input);
if (matcher.matches())
{
System.out.println("Input MATCHED: " + input);
while (matcher.find())
{
System.out.println("\t\t" + matcher.group(1));
}
}
else
{
System.out.println("\tInput NOT MATCHED: " + input);
}
Using the above code always results in "NOT MATCHED". Removing the if/else and only using matcher.find() does retrieve all text groups.
Upvotes: 2
Views: 1656
Reputation: 38300
RegexPal is a JavaScript regex tool. The Java and JavaScript regular expression languages differ. Consider using a Java Regex tool; perhaps this one
This may be close to what you want: (?:([^_\|]+)_{0,1}+\|*)+
Edit: Code added. In java 6 this prints each group (the find() loop).
public static void main(String[] args)
{
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher;
Pattern pattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)+");
Pattern groupPattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)");
matcher = pattern.matcher(input);
if (matcher.matches())
{
Matcher groupMatcher;
System.out.println("matcher.matches() is true");
int groupCount = matcher.groupCount();
for (int index = 1; index <= groupCount; ++index)
{
System.out.print("group (pattern)[");
System.out.print(index);
System.out.print("]: ");
System.out.println(matcher.group(index));
}
groupMatcher = groupPattern.matcher(input);
while (groupMatcher.find())
{
System.out.print("group (groupPattern):");
System.out.println(groupMatcher.group());
System.out.println(groupMatcher.group(1));
}
}
else
{
System.out.println("No match");
}
}
Upvotes: 1
Reputation: 9664
Matcher#matches
method attempts to match the entire input sequence against the pattern, that is why you are getting the result Input NOT MATCHED
. See the documentation here http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#matches
If you want to exclude the trailing underscore you can use this regex (slight modification of what you already have)
(?:\\||^)([^\\\\|_]*)
This would work if you are sure that _
comes just before |
.
Upvotes: 1