Anna
Anna

Reputation: 409

How to join XML tags under condition (Python)?

I have an XML file structured like this:

<?xml version="1.0" encoding="utf-8" ?>
    <pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="179.739,592.028,261.007,604.510">
    <textline bbox="179.739,592.028,261.007,604.510">
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">C</text>
    <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">A</text>
    <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">P</text>
    <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">I</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">T</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">L</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
    <text> </text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
    <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
    <text>
    </text>
    </textline>
    </textbox>
</page>
</pages>

Note that the file is much bigger than that, and it repeats pages. So I have two kinds of tags, differing in font and font size. I want to merge the letters of the same tags, so I would like an output that keeps the font and font size but also merges what can be merged together, like:

<?xml version="1.0" encoding="utf-8" ?>
        <pages>
        <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
        <textline bbox="179.739,592.028,261.007,604.510">
        <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">C</text>
        <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">API</text>
        <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">TOLO III</text>
        </textline>
        </textbox>
    </page>
    </pages>

The important thing is that the original order of the letters is kept, along with the tags (so I know which is the font size). So the code so far looks like this:

import xml.etree.ElementTree as ET

MY_XML = ET.parse('fe.xml')

textlines = MY_XML.findall("./page/textbox/textline")

for textline in textlines:
    fulltext = []
    for text_elem in list(textline):
        # Get the text of each 'text' element and then remove it
        fulltext.append(text_elem.text)
        textline.remove(text_elem)

    # Create a new 'text' element and add the joined letters to it
    new_text_elem = ET.Element("text", font="NUMPTY+ImprintMTnum", ncolour="0", size="12.482")
    new_text_elem.text = "".join(fulltext).strip()

    # Append the new 'text' element to its parent
    textline.append(new_text_elem)

print(ET.tostring(MY_XML.getroot(), encoding="unicode"))

But it works only for one tag. I think I would need to put a condition so that the for loop checks for all tags, but I haven't found information on the web on how to do it. How can I include the other tag? Many thanks

Upvotes: 0

Views: 98

Answers (2)

balderman
balderman

Reputation: 23815

Below is the core logic you need to implement.

It is not an End to End solution but it takes care of the core logic.

import xml.etree.ElementTree as ET


xml = '''<?xml version="1.0" encoding="utf-8" ?>
<pages>
    <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
        <textbox id="0" bbox="179.739,592.028,261.007,604.510">
            <textline bbox="179.739,592.028,261.007,604.510">
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">C</text>
                <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">A</text>
                <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">P</text>
                <text font="NUMPTY+ImprintMTnum-it"  ncolour="0" size="12.333">I</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">T</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">L</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">O</text>
                <text> </text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
                <text font="NUMPTY+ImprintMTnum"  ncolour="0" size="12.482">I</text>
                <text>
                </text>
            </textline>
        </textbox>
    </page>
</pages>'''
words = []
root = ET.fromstring(xml)
pages = root.findall('.//page')
for page in pages:
    previous_key = None
    current_key = None
    texts = page.findall('.//text')
    for txt in texts:
        if previous_key:
            current_key = (txt.attrib.get('font',previous_key[0]),txt.attrib.get('size',previous_key[1]))
        else:
            current_key = (txt.attrib.get('font','empty'),txt.attrib.get('size','empty'))
        if current_key != previous_key:
            words.append([])
        words[-1].append(txt.text)
        previous_key = current_key

for group in words:
    if group:
        print(''.join(group))

output

C
API
TOLO III

Upvotes: 1

Joe
Joe

Reputation: 7131

Why is new_text_elem a hardcoded Element, with fixed attributes? You don't know which attributes to assign.

Try the following. Create another inner for loop that writes ALL tags to a dictionary. You can iterate over tags as well.

For the next element check if all tags are in the dictionary and if they are the same. Read about dictionary comparison or just iterate over the keys and compare with ==.

If they are the same add the element to a list of identical elements you found so far. Then check the next element.

If they are not the same add all elements of the list as a new element, combining the text. Then clear the list and start over.

Summary:

  • You iterate through the <text> Elements.
  • Store all consecutive <text> Elements have the same tags in a list.
  • Storing them one after another in a list preserves the order.
  • Once you encounter the first different <text> Element, write the stored ones first, concatenating their text, using their stored tags and values.
  • Clear the store list and repeat.

Upvotes: 1

Related Questions