Reputation: 373
First time posting.
Firstly I know how to use both Pattern Matcher & String Split. My questions is which is best for me to use in my example and why? Or suggestions for better alternatives.
Task: I need to extract an unknown NOUN between two known regexp in an unknown string.
My Solution: get the Start and End of the noun (from Regexp 1&2) and substring to extract the noun.
String line = "unknownXoooXNOUNXccccccXunknown";
int goal = 12 ;
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
A) I can use pattern matcher
Pattern p = Pattern.compile(regexp1);
Matcher m = p.matcher(line);
if (m.find()) {
int afterRegex1 = m.end();
} else {
throw new IllegalArgumentException();
//TODO Exception Management;
}
B) I can use String Split
String[] split = line.split(regex1,2);
if (split.length != 2) {
throw new UnsupportedOperationException();
//TODO Exception Management;
}
int afterRegex1 = line.indexOf(split[1]);
Which Approach should I use and why? I don't know which is more efficient on time and memory. Both are near enough as readable to myself.
Upvotes: 8
Views: 10634
Reputation: 5585
I'd do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regex = "Xo+X(.*?)Xc+X";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
if (m.find()) {
String noun = m.group(1);
}
The (.*?)
is used to make the inner match on the NOUN reluctant. This protects us from a case where our ending pattern appears again in the unknown portion of the string.
EDIT
This works because the (.*?)
defines a capture group. There's only one such group defined in the pattern, so it gets index 1 (the parameter to m.group(1)
). These groups are indexed from left to right starting at 1. If the pattern were defined like this
String regex = "(Xo+X)(.*?)(Xc+X)";
Then there would be three capture groups, such that
m.group(1); // yields "XoooX"
m.group(2); // yields "NOUN"
m.group(3); // yields "XccccccX"
There is a group 0, but that matches the whole pattern, and it's equivalent to this
m.group(); // yields "XoooXNOUNXccccccX"
For more information about what you can do with the Matcher
, including ways to get the start and end positions of your pattern within the source string, see the Matcher JavaDocs
Upvotes: 6
Reputation: 298579
If you really need the locations you can do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
Matcher m=Pattern.compile(regexp1).matcher(line);
if(m.find())
{
int start=m.end();
if(m.usePattern(Pattern.compile(regexp2)).find())
{
final int end = m.start();
System.out.println("from "+start+" to "+end+" is "+line.substring(start, end));
}
}
But if you just need the word in between, I recommend the way Ian McLaird has shown.
Upvotes: 0
Reputation: 46492
It looks like you want to get a unique occurrence. For this do simply
input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1")
For efficiency, use Pattern.matcher(input).replaceAll
instead.
In case you input contains line breaks, use Pattern.DOTALL
or the s
modifier.
In case you want to use split, consider using Guava's Splitter
. It behaves more sane and also accepts a Pattern
which is good for speed.
Upvotes: 2
Reputation: 1669
You should use String.split()
for readability unless you're in a tight loop.
Per split()
's javadoc, split()
does the equivalent of Pattern.compile()
, which you can optimize away if you're in a tight loop.
Upvotes: 4