Reputation: 7645
I've been using the below code to try and extract the different sections out of the text I provided.
It should pick out the digits, and then any sections enclosed in [
braces or "
quotation marks into the groups. Here is the code.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Launcher2 {
/**
* @param args
*/
public static void main(String[] args) {
PrintRegexes("100.000[$₮-45]");
}
public static void PrintRegexes(String textToMatch){
Pattern p = Pattern.compile("(\\[.*?\\]|\".*?\")?.*?(\\d{1,3}(?:,\\d{3})*?(?:\\.\\d+)?).*?(\\[.*?\\]|\".*?\")",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(textToMatch);
if (m.find())
{
for(int groups =0;groups<m.groupCount();groups++){
System.out.println("Group "+groups+" contains "+m.group(groups));
}
for(int groups =0;m.find(groups);groups++){ //this will error, but right now, it's the least of my concerns
System.out.println("Group "+groups+" contains "+m.group(groups));
}
}
}
}
Group 0 contains 100.000[$₮-45]
Group 1 contains null
Group 2 contains 100.000
Group 3 contains [$₮-45]
Group 0 contains 100.000[$₮-45]
Group 1 contains null
Group 2 contains 0.000
Group 3 contains [$₮-45]
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 4 //don't care about this, I've got bigger strings(fish) to regex(fry) at the moment!
at java.util.regex.Matcher.group(Unknown Source)
at Launcher2.PrintRegexes(Launcher2.java:21)
at Launcher2.main(Launcher2.java:10)
All the groups are the same except for group 2
, one prints out as 0.000
, one prints out as 100.000
.
Why is this?
This behaviour goes away if I but something infront AND behind the digits.
If I just put something in front, I get this output:
Group 0 contains [$₮-45]100.000
Group 1 contains [$₮-45]
Group 2 contains 100.000
Group 3 contains null
Group 0 contains [$₮-45]100.000
Group 1 contains null
Group 2 contains 45
Group 3 contains null
Even Weirder! The strangest part (for me) is that this works on www.debuggex.com.
Am I writing my pattern wrong? Or is it that matcher doesn't work out the groups on when this method Matcher m = p.matcher(textToMatch);
constructs it, and that effects it's behaviour?
Upvotes: 1
Views: 331
Reputation: 46239
I can see two problems here.
Firstly, you call m.find()
multiple times with the group as argument, which doesn't work the way you think it does.
If you look at the JavaDoc for find(int start), you see that it resets the matcher, and then restarts the search starting at the specified character of the input. This explains the shorter number sequences matched in later iterations.
Secondly, you need to loop until groups <= m.groupCount()
to get all groups:
Pattern p =
Pattern.compile("(\\[.*?\\]|\".*?\")?.*?(\\d{1,3}(?:,\\d{3})*?(?:\\.\\d+)?).*?(\\[.*?\\]|\".*?\")",
Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(textToMatch);
if (m.find()) {
for (int groups = 0; groups <= m.groupCount(); groups++) {
System.out.println("Group " + groups + " contains " + m.group(groups));
}
}
prints
Group 0 contains 100.000[$₮-45]
Group 1 contains null
Group 2 contains 100.000
Group 3 contains [$₮-45]
Upvotes: 1
Reputation: 2055
Looks like the problem is with this part: (?:,\\d{3})*?
I think you need ((?:,\\d{3})*)?
Upvotes: 0