MMM
MMM

Reputation: 305

Extract all text regardless of tags with ElementTree

****EDITED TO ADD ROOT ELEMENT IN THE XML (and it changes nothing)****
I'm using python 3.7
I have an xml file named 'f':

<root>
 <page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
 </page>
 <page>
  <title>Chapter 2</title>
  <content>Welcome to Chapter 2</content>
 </page>
</root> 

****ALSO EDITED TO ADD This is part of a bigger code and for reasons the content of the file 'f' is in a type:

<class 'nt.DirEntry'>

And I got this type by grabbing the file from a folder using

for folder in os.scandir(folderPath):

****

I want to extract every piece of text in that xml regardless of the tags and how they are nested. So I would have :

Chapter 1
Welcome to Chapter 1
Chapter 2
Welcome to Chapter 2

I tried:

import xml.etree.ElementTree as ET
tree = ET.parse(f)
root = tree.getroot()
root.text #returns nothing
#or
root.tostring() #returns AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'tostring'

and I tried:

tree = ET.fromstring(f)
print(''.join(tree.itertext())) #returns TypeError: a bytes-like object is required, not 'nt.DirEntry'

thank you!

Upvotes: 0

Views: 1129

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 30971

Use the following code:

tree = et.parse('input.xml')
root = tree.getroot()
for it in root.iter():
    txt = it.text.strip()
    if txt:
        print(txt)

The reason to use strip and if is to filter out printing of elements with no text or containing only "white" characters.

Look at the other answer. It contains 2 empty lines. But my solution is free from such flaws.

Upvotes: 0

mzjn
mzjn

Reputation: 50947

  • You need to provide the filename as a string. In this case, f is a os.DirEntry object whose path is f.path.
  • itertext() is a method on Element objects.

Demo:

import xml.etree.ElementTree as ET

tree = ET.parse(f.path)
root = tree.getroot()
print(''.join(root.itertext())) 

Output:

Chapter 1
Welcome to Chapter 1


Chapter 2
Welcome to Chapter 2

Upvotes: 1

Related Questions