Reputation: 6052
I have an XML element, that looks like this:
XML
<page>
<textline id="1">
<text>C</text>
<text>o</text>
<text>n</text>
<text>t</text>
<text>a</text>
<text>k</text>
<text>t</text>
</textline>
<textline id="2">
<text>
</text>
</textline>
<textline id="3">
<text>M</text>
<text>e</text>
</textline>
</page>
I am trying to get all the <textline>
tags only:
with open(path_to_xml_file) as xml_file:
parsed_xml = BeautifulSoup(xml_file, 'xml')
text_lines = parsed_xml.find_all("textline")
However, text_lines
includes all children of <textline>
- which means it includes all the <text></text>
tags.
I can't seem to find anything in the documentation that explains how to only select the actual tag (and not any children, sub children etc.).
I found the recursive=False
option, which should only select direct children, so I thought I could apply this to the page
tag:
text_lines = parsed_xml.find_all("page", recursive=False)
But that returns an empty list: []
<textline id="1"></textline>
<textline id="2"></textline>
<textline id="3"></textline>
Upvotes: 1
Views: 96
Reputation: 6052
I know I originally tagged this question with beautifulsoup
, but I just wanted to share what I actually ended up using. The solution from @Rakesh does works with beaufitulsoup.
I actually ended up using Pythons built-in XML parser:
import xml.etree.ElementTree as ET
tree = ET.parse(path_to_xml_file)
root = tree.getroot()
for textline in root.iter('textline'):
print(textline)
I think this is a much cleaner solution - so hopefully this can help anyone comign across this post.
Upvotes: 0
Reputation: 384
You can use clear() method to remove all the inside <text>
tags from <textline>
tags,
one more thing you can't send file name to BeautifulSoup, you have to open it and send the content to it, here I kept xml content in a variable.
myxml = """<page>
<textline id="1">
<text>C</text>
<text>o</text>
<text>n</text>
<text>t</text>
<text>a</text>
<text>k</text>
<text>t</text>
</textline>
<textline id="2">
<text>
</text>
</textline>
<textline id="3">
<text>M</text>
<text>e</text>
</textline>
</page>"""
parsed_xml = BeautifulSoup(myxml, 'xml')
text_lines = parsed_xml.find_all("textline")
for tl in text_lines:
tl.clear()
print(text_lines)
Output:
[<textline id="1"/>, <textline id="2"/>, <textline id="3"/>]
Upvotes: 1
Reputation: 82765
You can set string=''
Ex:
xml = """<page>
<textline id="1">
<text>C</text>
<text>o</text>
<text>n</text>
<text>t</text>
<text>a</text>
<text>k</text>
<text>t</text>
</textline>
<textline id="2">
<text>
</text>
</textline>
<textline id="3">
<text>M</text>
<text>e</text>
</textline>
</page>"""
from bs4 import BeautifulSoup
parsed_xml = BeautifulSoup(xml, 'xml')
text_lines = []
for tag in parsed_xml.find_all("textline"):
tag.string = ''
text_lines.append(tag)
print(text_lines)
Output:
[<textline id="1"></textline>,
<textline id="2"></textline>,
<textline id="3"></textline>]
Upvotes: 2