oliverbj
oliverbj

Reputation: 6052

Python3 - Append each character to a string (making a line)

I have an XML element, that looks like this:

XML

<page>
    <textline id="1">
        <text>C</text>
        <text>o</text>
        <text>n</text>
        <text>t</text>
        <text>a</text>
        <text>k</text>
        <text>t</text>
    </textline>
    <textline id="2">
        <text>
        </text>
    </textline>
    <textline id="3">
        <text>M</text>
        <text>e</text>
    </textline>
</page>

I am trying to get all the <textline> tags only:

with open(path_to_xml_file) as xml_file:    
     parsed_xml = BeautifulSoup(xml_file, 'xml')
     text_lines = parsed_xml.find_all("textline")

However, text_lines includes all children of <textline> - which means it includes all the <text></text> tags.

I can't seem to find anything in the documentation that explains how to only select the actual tag (and not any children, sub children etc.).

I found the recursive=False option, which should only select direct children, so I thought I could apply this to the page tag:

text_lines = parsed_xml.find_all("page", recursive=False)

But that returns an empty list: []

Expected result:

<textline id="1"></textline>
<textline id="2"></textline>
<textline id="3"></textline>

Upvotes: 1

Views: 96

Answers (3)

oliverbj
oliverbj

Reputation: 6052

I know I originally tagged this question with beautifulsoup, but I just wanted to share what I actually ended up using. The solution from @Rakesh does works with beaufitulsoup.

I actually ended up using Pythons built-in XML parser:

import xml.etree.ElementTree as ET

tree = ET.parse(path_to_xml_file)
root = tree.getroot()

for textline in root.iter('textline'):
    print(textline)

I think this is a much cleaner solution - so hopefully this can help anyone comign across this post.

Upvotes: 0

Murali
Murali

Reputation: 384

You can use clear() method to remove all the inside <text> tags from <textline> tags,

one more thing you can't send file name to BeautifulSoup, you have to open it and send the content to it, here I kept xml content in a variable.

myxml = """<page>
<textline id="1">
  <text>C</text>
  <text>o</text>
  <text>n</text>
  <text>t</text>
  <text>a</text>
  <text>k</text>
  <text>t</text>
</textline>
<textline id="2">
  <text>
  </text>
</textline>
<textline id="3">
  <text>M</text>
  <text>e</text>
</textline>
</page>"""

parsed_xml = BeautifulSoup(myxml, 'xml')
text_lines = parsed_xml.find_all("textline")
for tl in text_lines:
    tl.clear()

print(text_lines)

Output:

[<textline id="1"/>, <textline id="2"/>, <textline id="3"/>]

Upvotes: 1

Rakesh
Rakesh

Reputation: 82765

You can set string=''

Ex:

xml = """<page>
<textline id="1">
  <text>C</text>
  <text>o</text>
  <text>n</text>
  <text>t</text>
  <text>a</text>
  <text>k</text>
  <text>t</text>
</textline>
<textline id="2">
  <text>
  </text>
</textline>
<textline id="3">
  <text>M</text>
  <text>e</text>
</textline>
</page>"""

from bs4 import BeautifulSoup
parsed_xml = BeautifulSoup(xml, 'xml')
text_lines = []
for tag in parsed_xml.find_all("textline"):
    tag.string = ''
    text_lines.append(tag)
print(text_lines)

Output:

[<textline id="1"></textline>,
 <textline id="2"></textline>,
 <textline id="3"></textline>]

Upvotes: 2

Related Questions