user1110005
user1110005

Reputation: 19

Java Regex xml parsing

I'm trying to find a tag from begin to end in xml and replace it with a blank. A sample xml is like this

<lins>
  <lin index="1"> ...<feature>Something</feature>... </lin>
  <lin index="2">...<feature>Something</feature>... </lin>
  <lin index="3">...<feature>Something</feature>....</lin>

  <lin index="1">...<feature>Icom</feature>... </lin>
  <lin index="2">...<feature>Icom</feature>... </lin>
<lins>

I need to remove <lin> to </lin> when ever I see Icom in between

<lin\s(.+?Icom.+?)+</lin> is removing all lin items since it matches the first begin <lin> tag and the last lin end tag. Greatly appreciated if you can suggest a way to do this. Also I can not use xml parsers in my situation.

Upvotes: 1

Views: 1159

Answers (3)

ozoli
ozoli

Reputation: 1444

I think you need to add more groups to the regexp.

Add a group for the precondition to start checking for ex (

Then a group for the stuff inbetween, a group for Icom etc.

So off the top of my head my RegEx would look like:

(<lin\ index\=)(\w+Icom\w+)(\<\/lin>)

Note the escaping might be slightly off, but in essence you need more groups and some less eager matchers.

Upvotes: 0

shift66
shift66

Reputation: 11958

you cant do it with regexp.
For this example:

<tag>
    <tag> something </tag>
</tag>

<tag>
</tag>

If you use "<tag>(.*)</tag>" regexp, your group will be this:

    <tag> something </tag>
</tag>

<tag>

and if you use "<tag>(.*?)</tag>" regexp, your group will be this:

    <tag> something

You should use something like stack to get the ending of started tag.

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336448

String result = subject.replaceAll("(?s)<lin\\b(?:(?!</lin).)*Icom(?:(?!</lin).)*</lin>", "");

should do this, unless you have <lin> tags nested into each other (or inside comments/strings).

Explanation:

<lin\b              # Match <lin (but not link or linen)
(?:                 # Match...
 (?!</lin)          # as long as we're not at a closing tag
 .                  # any character
)*                  # any number of times.
Icom                # Match Icom
(?:(?!</lin).)*     # (as above:) Match any character except closing tag
</lin>              # Match closing tag

Upvotes: 4

Related Questions