Reputation: 3211
I got English sentences whose words are XML-tagged, for example:
<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.
There are exactly those three possibilities for xml tags as the sentence shows (<XXX>
, <YYY>
, <ZZZ>
). The word count inside any of those tags can be infinite.
I need to split them at whitespaces ignoring whitespaces inside those XML tags. The code looks like:
String mySentence = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.";
String[] mySentenceSplit = mySentence.split("someUnknownRegex");
for (int i = 0; i < mySentenceSplit.length; i++) {
System.out.println(mySentenceSplit[i]);
}
Specifically for the example above the output should be like:
mySentenceSplit[0] = <XXX>word1</XXX>
mySentenceSplit[1] = word2
mySentenceSplit[2] = word3
mySentenceSplit[3] = <YYY>word4 word5 word6</YYY>
mySentenceSplit[4] = word7
mySentenceSplit[5] = word8
mySentenceSplit[6] = word9
mySentenceSplit[7] = word10
mySentenceSplit[8] = <ZZZ>word11 word12</ZZZ>.
What do i have to insert into "someUnknownRegex" to achieve this ?
Upvotes: 0
Views: 171
Reputation: 41838
kiltek, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse xml, here is a simple regex to do it:
<.*?</[^>]*>|( )
The left side of the alternation matches complete xml tags. We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
Here is working code (see online demo):
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>";
Pattern regex = Pattern.compile("<.*?</[^>]*>|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program
Reference
Upvotes: 0
Reputation: 425063
Here's the split regex you want:
String[] split = str.split(" +(?=[^<]*(<[^/]|$)");
Upvotes: 1
Reputation: 369134
Using capturing group and backreference:
String sentence = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.";
Pattern pattern = Pattern.compile("<(\\w+)[^>]*>.*?</\\1>\\.?|\\S+");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
System.out.println(matcher.group());
}
output:
<XXX>word1</XXX>
word2
word3
<YYY>word4 word5 word6</YYY>
word7
word8
word9
word10
<ZZZ>word11 word12</ZZZ>.
Upvotes: 2