MMM
MMM

Reputation: 305

python xml.etree.ElementTree remove empty tag in the middle of text

I have an xml document from which I want to extract text based on tags.
The part that I want to extract text from looks something like this :

<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>

When I do

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = text.text

I'm only able to grab the part that comes before the empty tag <TIP CONTENT="­"/>
I tried to delete this tag before getting the rest of the text.
I did :

emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
    root.remove(e)

But this is not working.
None of <BlockText> and <TIP> are direct children of root.


Thank you.

Upvotes: 0

Views: 1579

Answers (3)

dabingsou
dabingsou

Reputation: 2469

Another solution for reference only

from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))

Result:

{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']

Upvotes: 0

MMM
MMM

Reputation: 305

Ok this is what ended up working for me :

emptyTags = list(filter(lambda i: "TIP" in i, tags))
if emptyTags :
    emptyTag = list(filter(lambda i: "TIP" in i, emptyTags))[0]
for element in root.iter(emptyTag):
    print(element.tail)

But I still can't get the text as a whole block (same order). I can get all the BlockText tags and all the TIP tags but not together.

Update :
I used :

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = ''.join(text.itertext())

Upvotes: 1

egur
egur

Reputation: 7960

The text After <TIP CONTENT="­"/> belongs to its own tail not the text of the BlockText tag.

elem.text is the text following the open tag. elem.tail is the text following the close tag. Usually whitespace but in this case it's has actual text.

Upvotes: 0

Related Questions