Another Compiler Error
Another Compiler Error

Reputation: 373

Pattern Matcher Vs String Split, which should I use?

First time posting.

Firstly I know how to use both Pattern Matcher & String Split. My questions is which is best for me to use in my example and why? Or suggestions for better alternatives.

Task: I need to extract an unknown NOUN between two known regexp in an unknown string.

My Solution: get the Start and End of the noun (from Regexp 1&2) and substring to extract the noun.

String line = "unknownXoooXNOUNXccccccXunknown";
int goal = 12 ;
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
  1. I need to locate the index position AFTER the first regex.
  2. I need to locate the index position BEFORE the second regex.

A) I can use pattern matcher

    Pattern p = Pattern.compile(regexp1);
    Matcher m = p.matcher(line);
    if (m.find()) {
        int afterRegex1 = m.end();
    } else {
        throw new IllegalArgumentException();
        //TODO Exception Management;
    }

B) I can use String Split

    String[] split = line.split(regex1,2);
    if (split.length != 2) {
        throw new UnsupportedOperationException();
        //TODO Exception Management;
    }
    int afterRegex1 = line.indexOf(split[1]);

Which Approach should I use and why? I don't know which is more efficient on time and memory. Both are near enough as readable to myself.

Upvotes: 8

Views: 10634

Answers (4)

Ian McLaird
Ian McLaird

Reputation: 5585

I'd do it like this:

String line = "unknownXoooXNOUNXccccccXunknown";
String regex = "Xo+X(.*?)Xc+X";

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
if (m.find()) {
   String noun = m.group(1);
}

The (.*?) is used to make the inner match on the NOUN reluctant. This protects us from a case where our ending pattern appears again in the unknown portion of the string.

EDIT

This works because the (.*?) defines a capture group. There's only one such group defined in the pattern, so it gets index 1 (the parameter to m.group(1)). These groups are indexed from left to right starting at 1. If the pattern were defined like this

String regex = "(Xo+X)(.*?)(Xc+X)";

Then there would be three capture groups, such that

m.group(1); // yields "XoooX"
m.group(2); // yields "NOUN"
m.group(3); // yields "XccccccX"

There is a group 0, but that matches the whole pattern, and it's equivalent to this

m.group(); // yields "XoooXNOUNXccccccX"

For more information about what you can do with the Matcher, including ways to get the start and end positions of your pattern within the source string, see the Matcher JavaDocs

Upvotes: 6

Holger
Holger

Reputation: 298579

If you really need the locations you can do it like this:

String line = "unknownXoooXNOUNXccccccXunknown";
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";

Matcher m=Pattern.compile(regexp1).matcher(line);
if(m.find())
{
  int start=m.end();
  if(m.usePattern(Pattern.compile(regexp2)).find())
  {
    final int end = m.start();
    System.out.println("from "+start+" to "+end+" is "+line.substring(start, end));
  }
}

But if you just need the word in between, I recommend the way Ian McLaird has shown.

Upvotes: 0

maaartinus
maaartinus

Reputation: 46492

It looks like you want to get a unique occurrence. For this do simply

input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1")

For efficiency, use Pattern.matcher(input).replaceAll instead.

In case you input contains line breaks, use Pattern.DOTALL or the s modifier.


In case you want to use split, consider using Guava's Splitter. It behaves more sane and also accepts a Pattern which is good for speed.

Upvotes: 2

willkil
willkil

Reputation: 1669

You should use String.split() for readability unless you're in a tight loop.

Per split()'s javadoc, split() does the equivalent of Pattern.compile(), which you can optimize away if you're in a tight loop.

Upvotes: 4

Related Questions