LandonSchropp
LandonSchropp

Reputation: 10244

Pattern Parsing Java

Pretend my goal in a program is to parse as many occurrences of "ab" out of a string as I can. I approach this problem with the following code:

public static void main(String[] args)
{
    final String expression = "^(\\s*ab)";

    Scanner scanner = new Scanner("ab abab  ab");

    while (scanner.hasNext())
    {
        String next = scanner.findWithinHorizon(expression, 0);

        if (next == null)
        {
            System.out.println("FAIL");
            break;
        }
        else
        {
            System.out.println(next);
        }
    }
}

The caret at the beginning of the expression is to disallow anything but whitespace at the beginning of each read as mentioned here. It's there to prevent something like "cab" or "c ab" from being allowed. In fact, I would expect null to be returned and FAIL to be printed to the console if one of these two cases occur. If I remove the caret from the expression, it works perfectly fine on input such as "ab abab ab", but fails to return null for "c ab". On the other hand, if I leave the caret, then "c ab" returns null as expected but "ab abab ab" fails. How can I make this work?

Edit

My original post may have been a little vague. The example I gave above is a simpler version of my real problem. the pattern ab is a filler pattern I would replace with something more interesting, say an email address regex or a hexadecimal value.

In my application, the input to the scanner is not a string, but an input stream of which I have no knowledge. My goal in the loop is to read in values one at a time from the input and verify their contents match some pattern. If they do, then I could do something more interesting with them. If not, then the program terminates.

In the above example, I would expect an input of ab abab ab to output:

ab
 ab
ab
  ab

I would expect c ab to output:

FAIL

and I would expect ab cab to output:

ab
FAIL

Upvotes: 2

Views: 820

Answers (4)

Thomas
Thomas

Reputation: 88707

In the other thread you wanted to match the first occurence of ab so the caret was fine. If you want to match every occurence of ab until another character occurs, try this expression: String expression = "\\G(\\s*ab)";

The \G means that the next match should start at the position the previous stopped at.

If I use that with your code I get the following results:

  1. Input = "ab abab ab" , Output = "ab", " ab", "ab", " ab"

  2. Input = "cab abab ab", Output = "FAIL"

  3. Input = "ab c abab ab", Output = "ab", "FAIL"

  4. Input = "ab abab abc", Output = "ab", " ab", "ab", " ab", "FAIL"

Upvotes: 4

anubhava
anubhava

Reputation: 785176

Please understand that findWithinHorizon method in Scanner is for finding the next occurrence of a pattern constructed from the specified string and NOT for matching the whole input. If you write a regex that matched whole input then it will just return the input text as is (as per VMykyt's answer here). But that is not you want as I understand.

So you need to make a separate call to String#matches method to make sure there is nothing but spaces in front of your text and if it matches then just find all ab ocurrances.

Consider this minor change in your code:

public static void main(String[] args) {
   matchIt("ab abab  ab");
   matchIt("c ab");
   matchIt("cab");
}

private static void matchIt(String str) {
   final String expression = "ab";
   System.out.println("Input: [" + str + ']');
   Scanner scanner = new Scanner(str);

   if(str.matches("^\\s*ab.*$")) {
      while (scanner.hasNext()) {
         String next = scanner.findWithinHorizon(expression, 0);
         if (next == null) {
            System.out.println("FAIL");
            break;
         }
         else {
            System.out.println(next);
         }
      }
   }
   else
      System.out.println("FAIL");
}

OUTPUT:

Input: [ab abab  ab]
ab
ab
ab
ab
===========================
Input: [c ab]
FAIL
===========================
Input: [cab]
FAIL
===========================

Upvotes: 0

David
David

Reputation: 180

If I've gotten your question right, the fault is in the expression. If you always want a white space in the beginning you should use ^(\s+) and not ^(\s*) as * can be 0 occurrences while + mean at least one.

Upvotes: 0

VMykyt
VMykyt

Reputation: 1629

Well... I think you may do this with one call of regex

Try the following pattern:

expression = "^(\\s*ab*)*$";

Upvotes: 0

Related Questions