Reputation: 43
I have an xml with the following (invalid) structure
<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1>
I want to use sed to change it into
<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1>
i.e. I want to remove </tag1>...<tag1>
(and move everything in between under the enclosing tag1
), if I encounter an invalid xml substring as <tag1></*
I have tried using sed without success (one such attempt is below)
sed -e 's/<\/tag1>\(.*\)<tag1><\//\1<\//g'
It does work with the example above, but if I have two occurrence of the same condition it just removes the first </tag1>
and the last <tag1>
instead of performing the replacement twice
echo '<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1><tag1>text5<tag4>text6</tag1><tag3>text7</tag3><tag1></tag4>text8</tag1>' | sed -e 's/<\/tag1>\(.*\)<tag1><\//\1<\//g'
outputs
<tag1>text1<tag2>text2<tag3>text3</tag3><tag1></tag2>text4</tag1><tag1>text5<tag4>text6</tag1><tag3>text7</tag3></tag4>text8</tag1>
I think sed just expands the RE to cover the largest selection, but what should I do if I do not want it to do such thing ?
Upvotes: 4
Views: 104
Reputation: 2063
You want non-greedy matching, but to the best of my knowledge, sed doesn't support it. Can you use perl or do you have to use sed?
Try: perl -p -e 's/<\/tag1>(.*?)<tag1>(\<\/.+?<\/tag1>)/\1\2/g'
I think the issue is that the regex has to match through to the end of the actual closing or else that closing tag becomes the beginning of the next match.
Upvotes: 1
Reputation: 58558
This might work for you (GNU sed):
sed -r 's/<tag1>/\n/g;s/<\/tag1>(<tag3>[^\n]*)\n/\1/g;s/\n/<tag1>/g' file
Reduce <tag1>
to a unique character i.e \n
then use the negated character class [^\n]
to obtain non-greedy matching. Following the changes reverse the initial substitution.
Upvotes: 1
Reputation: 247162
GNU sed
sed '\,<tag1></,{ s,</tag1>,,; s,<tag1>,,2; }' <<END
<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1> <!-- error case -->
<tag1><tag2 /></tag1><tag1><tag3 /></tag1> <!-- should not change -->
END
<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1> <!-- error case -->
<tag1><tag2 /></tag1><tag1><tag3 /></tag1> <!-- should not change -->
If the string <tag1></
is seen, then remove the first </tag1>
and the second <tag1>
Upvotes: 0
Reputation: 88899
sed 's|</tag1><tag3>|<tag3>|;s|</tag3><tag1>|</tag3>|' file.xml
Output:
<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1>
Upvotes: 1