Reputation: 727
I am parsing a XML file with Python.
from xml.dom import minidom
xmldoc = minidom.parse('selections.xml')
But when I execute it, such anxml.parsers.expat.ExpatError: not well-formed (invalid token)
error occurred. After examining the file, I find there are too many < > in tags.
Therefore,I want to escape < and > in XML tags using regular expression.
For example, in the text tags, I want to escape the < and > outside of 'Winning 11'.
<writing>
<topic id="10">I am a fun</topic>
<date>2012-03-1</date>
<grade>86</grade>
<text>
You know he is a soccer fan,so you'd better to buy the game is <Winning 11>!
</text>
</writing>
I know the escape of < and > is <
and >
. As there are too many tags in my XML file therefore I want to use regular expression to solve it under vim.
Could anyone give me some ideas? I am a newbie in regular expression.
Upvotes: 1
Views: 1398
Reputation: 1022
In detail:
:%s/ #search and replace on all lines in file
\( #open \1 group
<text> #\n find <text> tag with newline at it's end
.* #grab all text until next match
\) #close \1 group
< #the `<` mark we're looking for
\( #open \2 group
.*\n #grab all text until end of line
.* #grab text on the next line
<\/text> #find </text> tag
\) #close \2 group
/ #vi replace with
\1 #paste \1 group in
\< #replace `<` with it's escaped version
\2 #paste \2 group in
/g #Do on all occurrences
:%s/\(<text>\n.*\)<\(.*\n.*<\/text>\)/\1\<\2/g
The second one is like the first, I've replaced <
with >
and <
with >
:%s/\(<text>\n.*\)>\(.*\n.*<\/text>\)/\1\>\2/g
combine with |
:%s/\(<text>\n.*\)<\(.*\n.*<\/text>\)/\1\<\2/g | %s/\(<text>\n.*\)>\(.*\n.*<\/text>\)/\1\>\2/g
Reference:
Capturing Groups and Backreferences
Regex without vim escaping for <
part, see the first group is until the <
mark and the second is right after
Upvotes: 2
Reputation: 11351
Not a good situation to be in, really.
However, if you know the valid xml tags in your file, then the following will match only the 'bad tags' you want to escape:
<(?!/?grade|/?text)([^>]+)>
add more valid tags to that list in the form |\?tag
.
Then you can substitute with
<$1>
Here it is on regexr.
If you need to do this in vim, then you'll need to translate that into vim regex, which isn't quite the same.
Upvotes: 0