Reputation: 3081
Let's say I have two Java Patterns, one for finding whitespace at the beginning of the line, and the other for finding non-whitespace at the beginning of the line:
Pattern ws = Pattern.compile("^\\s+");
Pattern nws = Pattern.compile("^\\S+");
String text = "\tSome \n\t text \n that needs \t parsing.";
I want to loop through the text, separating it blocks of whitespace and blocks of non-whitespace, removing each token from the beginning of text:
while(text.length() > 0) {
String nextToken = "";
try {
//TODO: detect grouping and move it to nextToken.
} catch (Exception e) {
//TODO: error handling
}
if(nextToken.length() > 0)
_tokens.add(nextToken);
}
I don't just want to replace stuff. "\tSome \n\t text \n that needs \t parsing." should split to ["\t", "Some", "\n\t ", "text", ...]
How would you accomplish something like this?
Upvotes: 1
Views: 107
Reputation: 124215
After your update it seems that your goal may be to separate whitespaces from non-whitespaces. In that case place on which you should split can be described by regex which will use look-around mechanisms. In other words regex should be matching places which have
Such regex can look like "(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)"
and you can use it in split
method
String text = "\tSome \n\t text \n that needs \t parsing.";
for (String s:text.split("(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)"))
System.out.println("'"+s+"'");
On the other hand you may wan't to also use alternation operator - OR
which is represented by |
and find
method from Matcher
to iterate over text and find matching substrings.
String text = "\tSome \n\t text \n that needs \t parsing.";
Pattern p = Pattern.compile("\\s+|\\S+");
Matcher m = p.matcher(text);
while(m.find())
System.out.println("'"+m.group()+"'");
In both cases output will be
' '
'Some'
'
'
'text'
'
'
'that'
' '
'needs'
' '
'parsing.'
(I surrounded results with '
to show that for instance firs result does in fact contain tabulator \t
which is printed as ' '
)
Upvotes: 1
Reputation: 10534
You could use a Scanner
and a single Pattern
which matches either kind of token.
Pattern tokenPattern = Pattern.compile("\\s+|\\S+");
String text = "\tSome \n\t text \n that needs \t parsing.";
List<String> tokens = new ArrayList<String>();
Scanner scanner = new Scanner(text);
while (true) {
String token = scanner.findWithinHorizon(tokenPattern, 0);
if (token == null) break;
tokens.add(token);
}
System.out.println(tokens);
Upvotes: 2
Reputation: 174696
This would remove all the spaces or non-space characters which was present at the start,
System.out.println(str.replaceAll("^(?:\\s+|\\S+)", ""));
Upvotes: 1