Reputation: 257
QExtremely amateur programmer here, looking for your help.
I have to frequently edit xml files that look like this
--- blah blah blah plenty xml stuff above ---
<lex marker="mala" sentiment="negative"/>
<lex marker="malas" sentiment="negative"/>
<lex marker="maleducad\p{Ll}*" sentiment="negative" regex="true"/>
<lex marker="mali\p{Ll}+sima\p{Ll}*" sentiment="negative" regex="true"/>
<lex marker="mali\p{Ll}+simo\p{Ll}*" sentiment="negative" regex="true"/>
--- blah blah blah plenty xml stuff below ---
And using a rather convoluted regex search and replace process I can extract ONLY the value of the marker attribute. (that is all I care for).
But it's time consuming and there must be pretty simple way in Python to look for the attribute marker="SOME_TEXT" part and plonk all the values into an array, and then afterwards print out that array (to a file). But I can't figure it out :(
I'm looking for a way that doesn't including importing any kind of XML library because I want to keep it as simple (and logical) as possible for my amateur programming mind to learn from) and I'm only interested in the data from that particular attribute anyway, and I care not for any of the rest of the file (or it's XML-ness).
I only ask in python because I think it's a language I'm keen to get into. but if you can think of a Linux Terminal way to do it (sed, awk e.t.c.) i'm happy to go that route too.
Upvotes: 3
Views: 257
Reputation: 1121794
Matching XML with regular expressions get too complicated, too fast. You really should not do that.
Use a XML parser instead, Python has several to choose from:
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.xml')
for elem in tree.findall('lex'):
print elem.attrib['marker']
Upvotes: 4