aName
aName

Reputation: 257

extract values from xml

QExtremely amateur programmer here, looking for your help.

I have to frequently edit xml files that look like this

    --- blah blah blah plenty xml stuff above ---
    <lex marker="mala" sentiment="negative"/>
    <lex marker="malas" sentiment="negative"/>
    <lex marker="maleducad\p{Ll}*" sentiment="negative" regex="true"/>
    <lex marker="mali\p{Ll}+sima\p{Ll}*" sentiment="negative" regex="true"/>
    <lex marker="mali\p{Ll}+simo\p{Ll}*" sentiment="negative" regex="true"/>
    --- blah blah blah plenty xml stuff below ---

And using a rather convoluted regex search and replace process I can extract ONLY the value of the marker attribute. (that is all I care for).

But it's time consuming and there must be pretty simple way in Python to look for the attribute marker="SOME_TEXT" part and plonk all the values into an array, and then afterwards print out that array (to a file). But I can't figure it out :(

I'm looking for a way that doesn't including importing any kind of XML library because I want to keep it as simple (and logical) as possible for my amateur programming mind to learn from) and I'm only interested in the data from that particular attribute anyway, and I care not for any of the rest of the file (or it's XML-ness).

I only ask in python because I think it's a language I'm keen to get into. but if you can think of a Linux Terminal way to do it (sed, awk e.t.c.) i'm happy to go that route too.

Upvotes: 3

Views: 257

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121794

Matching XML with regular expressions get too complicated, too fast. You really should not do that.

Use a XML parser instead, Python has several to choose from:

  • ElementTree is part of the standard library
  • lxml is a fast and feature-rich C-based library.

ElementTree example:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.xml')
for elem in tree.findall('lex'):
    print elem.attrib['marker']

Upvotes: 4

Related Questions