Tag unrecognized during iterparsing using lxml

Question

I have a really weird problem with lxml, I try to parse my xml file with iterparse as follow:

for event, elem in etree.iterparse(input_file, events=('start', 'end')):
    if elem.tag == 'tuv' and event == 'start':
        if elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
            if elem.find('seg') is not None:
                write_in_some_file
        elif elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'de':
            if elem.find('seg') is not None:
                write_in_some_file

It is pretty simple and works almost perfectly, shortly it goes through my xml file, if an elem is it checks if the language attribute is 'en' or 'de', it then checks if the got a child, if yes it writes its value into a file

There is ONE < seg > in the file that seems not existing, returning None when doing elem.find('seg'), you can see it here and you have it in its context below ! keine Spalten und Ventile.

I don't understand why this tag which seems perfectly fine creates a problem (since I can't use its .text), note that every other tag is find well


1.67647
0.6683
0.7813
0.740740740741

 http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html
 http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html
 ! no gaps and valves


 http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html
 http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html
 ! keine Spalten und Ventile

Daniel Haley · Accepted Answer

In the lxml docs there is this warning:

WARNING: During the 'start' event, any content of the element, such as the descendants, following siblings or text, is not yet available and should not be accessed. Only attributes are guaranteed to be set.

Maybe instead of using find() from tu to get the seg element, change your "if" statement to match seg and the "end" event.

You can use getparent() to get the xml:lang attribute value from the parent tu.

Example ("test.xml" with an additional "tu" element for testing)


    
        1.67647
        0.6683
        0.7813
        0.740740740741
        
            http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html
            http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html
            ! no gaps and valves
        
        
            http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html
            http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html
            ! keine Spalten und Ventile
        
    
    
        1.67647
        0.6683
        0.7813
        0.740740740741
        
            http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html
            http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html
            ! no gaps and valves #2
        
        
            http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html
            http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html
            ! keine Spalten und Ventile #2

Python 3.x

from lxml import etree

for event, elem in etree.iterparse("test.xml", events=("start", "end")):

    if elem.tag == "seg" and event == "end":
        current_lang = elem.getparent().get("{http://www.w3.org/XML/1998/namespace}lang")
        if current_lang == "en":
            print(f"Writing en text \"{elem.text}\" to file...")
        elif current_lang == "de":
            print(f"Writing de text \"{elem.text}\" to file...")
        else:
            print(f"Unable to determine language. Not writing \"{elem.text}\" to any file.")

    if event == "end":
        elem.clear()

Printed Output

Writing en text "! no gaps and valves" to file...
Writing de text "! keine Spalten und Ventile" to file...
Writing en text "! no gaps and valves #2" to file...
Writing de text "! keine Spalten und Ventile #2" to file...

Tag unrecognized during iterparsing using lxml

Answers (2)

Related Questions