John Brown
John Brown

Reputation: 43

How to use sed to fix an xml issue

I have an xml with the following (invalid) structure

<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1>

I want to use sed to change it into

<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1>

i.e. I want to remove </tag1>...<tag1> (and move everything in between under the enclosing tag1), if I encounter an invalid xml substring as <tag1></*

I have tried using sed without success (one such attempt is below)

sed -e 's/<\/tag1>\(.*\)<tag1><\//\1<\//g'

It does work with the example above, but if I have two occurrence of the same condition it just removes the first </tag1> and the last <tag1> instead of performing the replacement twice

echo '<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1><tag1>text5<tag4>text6</tag1><tag3>text7</tag3><tag1></tag4>text8</tag1>' | sed -e 's/<\/tag1>\(.*\)<tag1><\//\1<\//g'

outputs

<tag1>text1<tag2>text2<tag3>text3</tag3><tag1></tag2>text4</tag1><tag1>text5<tag4>text6</tag1><tag3>text7</tag3></tag4>text8</tag1>

I think sed just expands the RE to cover the largest selection, but what should I do if I do not want it to do such thing ?

Upvotes: 4

Views: 104

Answers (4)

gymbrall
gymbrall

Reputation: 2063

You want non-greedy matching, but to the best of my knowledge, sed doesn't support it. Can you use perl or do you have to use sed?

Try: perl -p -e 's/<\/tag1>(.*?)<tag1>(\<\/.+?<\/tag1>)/\1\2/g'

I think the issue is that the regex has to match through to the end of the actual closing or else that closing tag becomes the beginning of the next match.

Upvotes: 1

potong
potong

Reputation: 58558

This might work for you (GNU sed):

sed -r 's/<tag1>/\n/g;s/<\/tag1>(<tag3>[^\n]*)\n/\1/g;s/\n/<tag1>/g' file

Reduce <tag1> to a unique character i.e \n then use the negated character class [^\n] to obtain non-greedy matching. Following the changes reverse the initial substitution.

Upvotes: 1

glenn jackman
glenn jackman

Reputation: 247162

GNU sed

sed '\,<tag1></,{ s,</tag1>,,; s,<tag1>,,2; }' <<END
<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1>  <!-- error case -->
<tag1><tag2 /></tag1><tag1><tag3 /></tag1>                                <!-- should not change -->
END
<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1>  <!-- error case -->
<tag1><tag2 /></tag1><tag1><tag3 /></tag1>                                <!-- should not change -->

If the string <tag1></ is seen, then remove the first </tag1> and the second <tag1>

Upvotes: 0

Cyrus
Cyrus

Reputation: 88899

sed 's|</tag1><tag3>|<tag3>|;s|</tag3><tag1>|</tag3>|' file.xml

Output:

<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1>

Upvotes: 1

Related Questions