allenwang
allenwang

Reputation: 727

how to escape '<' and '>' in xml tags using regular expression with vim?

I am parsing a XML file with Python.

from xml.dom import minidom
xmldoc = minidom.parse('selections.xml')

But when I execute it, such anxml.parsers.expat.ExpatError: not well-formed (invalid token)error occurred. After examining the file, I find there are too many < > in tags. Therefore,I want to escape < and > in XML tags using regular expression. For example, in the text tags, I want to escape the < and > outside of 'Winning 11'.

<writing>
    <topic id="10">I am a fun</topic>
    <date>2012-03-1</date>   
    <grade>86</grade>
    <text>
          You know he is a soccer fan,so you'd better to buy the game is <Winning 11>!
    </text>
</writing>

I know the escape of < and > is &lt; and &gt;. As there are too many tags in my XML file therefore I want to use regular expression to solve it under vim.

Could anyone give me some ideas? I am a newbie in regular expression.

Upvotes: 1

Views: 1398

Answers (2)

shevski
shevski

Reputation: 1022

In detail:

:%s/    #search and replace on all lines in file
\(      #open \1 group
<text>  #\n find <text> tag with newline at it's end
.*      #grab all text until next match
\)      #close \1  group
<       #the `<` mark we're looking for
\(      #open \2 group
.*\n    #grab all text until end of line
.*      #grab text on the next line
<\/text> #find </text> tag
\)      #close \2 group
/       #vi replace with
\1      #paste \1 group in
\&lt;   #replace `<` with it's escaped version
\2      #paste \2 group in
/g      #Do on all occurrences

:%s/\(<text>\n.*\)<\(.*\n.*<\/text>\)/\1\&lt;\2/g

The second one is like the first, I've replaced < with > and &lt; with &gt;

:%s/\(<text>\n.*\)>\(.*\n.*<\/text>\)/\1\&gt;\2/g

combine with |

:%s/\(<text>\n.*\)<\(.*\n.*<\/text>\)/\1\&lt;\2/g | %s/\(<text>\n.*\)>\(.*\n.*<\/text>\)/\1\&gt;\2/g

Reference:
Capturing Groups and Backreferences

Regex without vim escaping for < part, see the first group is until the < mark and the second is right after

Upvotes: 2

Karl Barker
Karl Barker

Reputation: 11351

Not a good situation to be in, really.

However, if you know the valid xml tags in your file, then the following will match only the 'bad tags' you want to escape:

<(?!/?grade|/?text)([^>]+)>

add more valid tags to that list in the form |\?tag.

Then you can substitute with

&lt;$1&gt;

Here it is on regexr.

If you need to do this in vim, then you'll need to translate that into vim regex, which isn't quite the same.

Upvotes: 0

Related Questions