Sergi
Sergi

Reputation: 417

Parsing XML Python

I am using xml.etree.ElementTree to parse an XML file. I have a problem. I do not know how to obtain a plain text line between tags.

<Sync time="4.496"/>
<Background time="4.496" type="music" level="high"/>

<Event desc="pause" type="noise" extent="instantaneous"/>
Plain text
<Sync time="7.186"/>

<Event desc="b" type="noise" extent="instantaneous"/>
Plain text
<Sync time="10.949"/>
Plain text

I have this code already:

import xml.etree.ElementTree as etree
import os

data_file = "./file.xml"

xmlD = etree.parse(data_file)
root = xmlD.getroot()
sections = root.getchildren()[2].getchildren()
for section in sections:
    turns = section.getchildren()
    for turn in turns:
        speaker = turn.get('speaker')
    mode = turn.get('mode')
    childs = turn.getchildren()

        for child in childs:
            time = child.get('time')
            opt = child.get('desc')
            if opt == 'es':
                 opt = "ESP:"
            elif opt == "la":
                 opt = "LATIN:"
            elif opt == "*":
                 opt = "-ININT-"
            elif opt == "fs":
                 opt = "-FS-"
            elif opt == "throat":
                 opt = "-THROAT-"
            elif opt == "laugh":
                 opt = "-LAUGH-"
            else:
                 opt = ""

            print speaker, mode, time, opt+child.tail.encode('latin-1')

I can access through the XML until the Sync|Background|Event tag, and can't extract the text after these tags. I put a piece of the XML file, no the entire file. I only have problems with the final piece of code

Thank you so much @alecxe . Now I can get the info that I needed. But now I have a new little problem. I obtain the line typing the tail command but a newline character \n is generated before or something similar, so, I need something like: spk1 planned LAN: Plain text from tail>

But I get this:

spk1 planned LAN: Plain text from tail

I have tried many things, re.match() module, sed commands after processing the XML, but it seems there is no \n new line character, but I can't "put up" the plain text! Thank you in advance

Anyone? Thank you!

Upvotes: 2

Views: 379

Answers (1)

alecxe
alecxe

Reputation: 474181

This is called a tail of an element:

The tail attribute can be used to hold additional data associated with the element. This attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found after the element’s end tag and before the next tag.

Locate the Event tag and get the tail, example:

section.find("Event").tail

Upvotes: 3

Related Questions