Valentin Macé
Valentin Macé

Reputation: 1272

Tag unrecognized during iterparsing using lxml

I have a really weird problem with lxml, I try to parse my xml file with iterparse as follow:

for event, elem in etree.iterparse(input_file, events=('start', 'end')):
    if elem.tag == 'tuv' and event == 'start':
        if elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
            if elem.find('seg') is not None:
                write_in_some_file
        elif elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'de':
            if elem.find('seg') is not None:
                write_in_some_file

It is pretty simple and works almost perfectly, shortly it goes through my xml file, if an elem is it checks if the language attribute is 'en' or 'de', it then checks if the got a child, if yes it writes its value into a file

There is ONE < seg > in the file that seems not existing, returning None when doing elem.find('seg'), you can see it here and you have it in its context below <seg>! keine Spalten und Ventile</seg>.

I don't understand why this tag which seems perfectly fine creates a problem (since I can't use its .text), note that every other tag is find well

<tu tuid="235084307" datatype="Text">
<prop type="score">1.67647</prop>
<prop type="score-zipporah">0.6683</prop>
<prop type="score-bicleaner">0.7813</prop>
<prop type="lengthRatio">0.740740740741</prop>
<tuv xml:lang="en">
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
 <seg>! no gaps and valves</seg>
</tuv>
<tuv xml:lang="de">
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
 <seg>! keine Spalten und Ventile</seg>
</tuv>
</tu>

Upvotes: 0

Views: 291

Answers (2)

Daniel Haley
Daniel Haley

Reputation: 52878

In the lxml docs there is this warning:

WARNING: During the 'start' event, any content of the element, such as the descendants, following siblings or text, is not yet available and should not be accessed. Only attributes are guaranteed to be set.

Maybe instead of using find() from tu to get the seg element, change your "if" statement to match seg and the "end" event.

You can use getparent() to get the xml:lang attribute value from the parent tu.

Example ("test.xml" with an additional "tu" element for testing)

<tus>
    <tu tuid="235084307" datatype="Text">
        <prop type="score">1.67647</prop>
        <prop type="score-zipporah">0.6683</prop>
        <prop type="score-bicleaner">0.7813</prop>
        <prop type="lengthRatio">0.740740740741</prop>
        <tuv xml:lang="en">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! no gaps and valves</seg>
        </tuv>
        <tuv xml:lang="de">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! keine Spalten und Ventile</seg>
        </tuv>
    </tu>
    <tu tuid="235084307A" datatype="Text">
        <prop type="score">1.67647</prop>
        <prop type="score-zipporah">0.6683</prop>
        <prop type="score-bicleaner">0.7813</prop>
        <prop type="lengthRatio">0.740740740741</prop>
        <tuv xml:lang="en">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! no gaps and valves #2</seg>
        </tuv>
        <tuv xml:lang="de">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! keine Spalten und Ventile #2</seg>
        </tuv>
    </tu>
</tus>

Python 3.x

from lxml import etree

for event, elem in etree.iterparse("test.xml", events=("start", "end")):

    if elem.tag == "seg" and event == "end":
        current_lang = elem.getparent().get("{http://www.w3.org/XML/1998/namespace}lang")
        if current_lang == "en":
            print(f"Writing en text \"{elem.text}\" to file...")
        elif current_lang == "de":
            print(f"Writing de text \"{elem.text}\" to file...")
        else:
            print(f"Unable to determine language. Not writing \"{elem.text}\" to any file.")

    if event == "end":
        elem.clear()

Printed Output

Writing en text "! no gaps and valves" to file...
Writing de text "! keine Spalten und Ventile" to file...
Writing en text "! no gaps and valves #2" to file...
Writing de text "! keine Spalten und Ventile #2" to file...

Upvotes: 1

Jack Fleeting
Jack Fleeting

Reputation: 24928

I'm not sure if this is what you're looking (I'm pretty new to this myself), but

for event, elem in etree.iterparse('xml_try.txt', events=('start', 'end')):
if elem.tag == 'tuv' and event == 'start':
    if elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
        if elem.find('seg') is not None:
            print(elem[2].text)
    elif elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'de':
        if elem.find('seg') is not None:
            print(elem[2].text)

Generates this output:

! no gaps and valves
! keine Spalten und Ventile

Again, apologies if this isn't what you're after.

Upvotes: 1

Related Questions