AaronF
AaronF

Reputation: 3081

Parsing string with multiple regex's

Let's say I have two Java Patterns, one for finding whitespace at the beginning of the line, and the other for finding non-whitespace at the beginning of the line:

Pattern ws  = Pattern.compile("^\\s+");
Pattern nws = Pattern.compile("^\\S+");
String text = "\tSome \n\t text \n that needs \t parsing.";

I want to loop through the text, separating it blocks of whitespace and blocks of non-whitespace, removing each token from the beginning of text:

while(text.length() > 0) {
    String nextToken = "";
    try {
        //TODO: detect grouping and move it to nextToken.
    } catch (Exception e) {
        //TODO: error handling
    }
    if(nextToken.length() > 0)
        _tokens.add(nextToken);
}

I don't just want to replace stuff. "\tSome \n\t text \n that needs \t parsing." should split to ["\t", "Some", "\n\t ", "text", ...]

How would you accomplish something like this?

Upvotes: 1

Views: 107

Answers (3)

Pshemo
Pshemo

Reputation: 124215

After your update it seems that your goal may be to separate whitespaces from non-whitespaces. In that case place on which you should split can be described by regex which will use look-around mechanisms. In other words regex should be matching places which have

  • non-whitespace before and whitespace after it
  • or whitespace before and non-whitespace character after it.

Such regex can look like "(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)" and you can use it in split method

String text = "\tSome \n\t text \n that needs \t parsing.";
for (String s:text.split("(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)"))
    System.out.println("'"+s+"'");

On the other hand you may wan't to also use alternation operator - OR which is represented by | and find method from Matcher to iterate over text and find matching substrings.

String text = "\tSome \n\t text \n that needs \t parsing.";

Pattern p = Pattern.compile("\\s+|\\S+");
Matcher m = p.matcher(text);
while(m.find())
    System.out.println("'"+m.group()+"'");

In both cases output will be

'   '
'Some'
' 
     '
'text'
' 
 '
'that'
' '
'needs'
'    '
'parsing.'

(I surrounded results with ' to show that for instance firs result does in fact contain tabulator \t which is printed as ' ')

Upvotes: 1

Robert Tupelo-Schneck
Robert Tupelo-Schneck

Reputation: 10534

You could use a Scanner and a single Pattern which matches either kind of token.

Pattern tokenPattern  = Pattern.compile("\\s+|\\S+");
String text = "\tSome \n\t text \n that needs \t parsing.";
List<String> tokens = new ArrayList<String>();
Scanner scanner = new Scanner(text);
while (true) {
    String token = scanner.findWithinHorizon(tokenPattern, 0);
    if (token == null) break;
    tokens.add(token);
}
System.out.println(tokens);

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174696

This would remove all the spaces or non-space characters which was present at the start,

System.out.println(str.replaceAll("^(?:\\s+|\\S+)", ""));

Upvotes: 1

Related Questions