Reputation: 3211

java regex split at whitespace except whitespace inside xml

I got English sentences whose words are XML-tagged, for example:

<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.

There are exactly those three possibilities for xml tags as the sentence shows (<XXX>, <YYY>, <ZZZ>). The word count inside any of those tags can be infinite.

I need to split them at whitespaces ignoring whitespaces inside those XML tags. The code looks like:

String mySentence = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.";
String[] mySentenceSplit = mySentence.split("someUnknownRegex");
for (int i = 0; i < mySentenceSplit.length; i++) {
    System.out.println(mySentenceSplit[i]);
}

Specifically for the example above the output should be like:

mySentenceSplit[0] = <XXX>word1</XXX>
mySentenceSplit[1] = word2 
mySentenceSplit[2] = word3 
mySentenceSplit[3] = <YYY>word4 word5 word6</YYY>
mySentenceSplit[4] = word7 
mySentenceSplit[5] = word8 
mySentenceSplit[6] = word9 
mySentenceSplit[7] = word10
mySentenceSplit[8] = <ZZZ>word11 word12</ZZZ>.

What do i have to insert into "someUnknownRegex" to achieve this ?

Upvotes: 0

Answers (3)

zx81

Reputation: 41838

kiltek, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)

With all the disclaimers about using regex to parse xml, here is a simple regex to do it:

<.*?</[^>]*>|( )

The left side of the alternation matches complete xml tags. We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.

Here is working code (see online demo):

import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;

class Program {
public static void main (String[] args) throws java.lang.Exception  {

String subject = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>";
Pattern regex = Pattern.compile("<.*?</[^>]*>|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
    if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
    else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program

Reference

Upvotes: 0

Bohemian

Reputation: 425063

Here's the split regex you want:

String[] split = str.split(" +(?=[^<]*(<[^/]|$)");

Upvotes: 1

falsetru

Reputation: 369134

Using capturing group and backreference:

String sentence = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.";
Pattern pattern = Pattern.compile("<(\\w+)[^>]*>.*?</\\1>\\.?|\\S+");
Matcher matcher = pattern.matcher(sentence);

while (matcher.find()) {
    System.out.println(matcher.group());
}

output:

<XXX>word1</XXX>
word2
word3
<YYY>word4 word5 word6</YYY>
word7
word8
word9
word10
<ZZZ>word11 word12</ZZZ>.

Upvotes: 2

java regex split at whitespace except whitespace inside xml

Answers (3)

Related Questions