Reputation: 1272
I have a really weird problem with lxml, I try to parse my xml file with iterparse as follow:
for event, elem in etree.iterparse(input_file, events=('start', 'end')):
if elem.tag == 'tuv' and event == 'start':
if elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
if elem.find('seg') is not None:
write_in_some_file
elif elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'de':
if elem.find('seg') is not None:
write_in_some_file
It is pretty simple and works almost perfectly, shortly it goes through my xml file, if an elem is it checks if the language attribute is 'en' or 'de', it then checks if the got a child, if yes it writes its value into a file
There is ONE < seg > in the file that seems not existing, returning None when doing elem.find('seg'), you can see it here and you have it in its context below <seg>! keine Spalten und Ventile</seg>
.
I don't understand why this tag which seems perfectly fine creates a problem (since I can't use its .text), note that every other tag is find well
<tu tuid="235084307" datatype="Text">
<prop type="score">1.67647</prop>
<prop type="score-zipporah">0.6683</prop>
<prop type="score-bicleaner">0.7813</prop>
<prop type="lengthRatio">0.740740740741</prop>
<tuv xml:lang="en">
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
<seg>! no gaps and valves</seg>
</tuv>
<tuv xml:lang="de">
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
<seg>! keine Spalten und Ventile</seg>
</tuv>
</tu>
Upvotes: 0
Views: 291
Reputation: 52878
In the lxml docs there is this warning:
WARNING: During the 'start' event, any content of the element, such as the descendants, following siblings or text, is not yet available and should not be accessed. Only attributes are guaranteed to be set.
Maybe instead of using find()
from tu
to get the seg
element, change your "if" statement to match seg
and the "end" event.
You can use getparent()
to get the xml:lang
attribute value from the parent tu
.
Example ("test.xml" with an additional "tu" element for testing)
<tus>
<tu tuid="235084307" datatype="Text">
<prop type="score">1.67647</prop>
<prop type="score-zipporah">0.6683</prop>
<prop type="score-bicleaner">0.7813</prop>
<prop type="lengthRatio">0.740740740741</prop>
<tuv xml:lang="en">
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
<seg>! no gaps and valves</seg>
</tuv>
<tuv xml:lang="de">
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
<seg>! keine Spalten und Ventile</seg>
</tuv>
</tu>
<tu tuid="235084307A" datatype="Text">
<prop type="score">1.67647</prop>
<prop type="score-zipporah">0.6683</prop>
<prop type="score-bicleaner">0.7813</prop>
<prop type="lengthRatio">0.740740740741</prop>
<tuv xml:lang="en">
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
<seg>! no gaps and valves #2</seg>
</tuv>
<tuv xml:lang="de">
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
<prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
<seg>! keine Spalten und Ventile #2</seg>
</tuv>
</tu>
</tus>
Python 3.x
from lxml import etree
for event, elem in etree.iterparse("test.xml", events=("start", "end")):
if elem.tag == "seg" and event == "end":
current_lang = elem.getparent().get("{http://www.w3.org/XML/1998/namespace}lang")
if current_lang == "en":
print(f"Writing en text \"{elem.text}\" to file...")
elif current_lang == "de":
print(f"Writing de text \"{elem.text}\" to file...")
else:
print(f"Unable to determine language. Not writing \"{elem.text}\" to any file.")
if event == "end":
elem.clear()
Printed Output
Writing en text "! no gaps and valves" to file...
Writing de text "! keine Spalten und Ventile" to file...
Writing en text "! no gaps and valves #2" to file...
Writing de text "! keine Spalten und Ventile #2" to file...
Upvotes: 1
Reputation: 24928
I'm not sure if this is what you're looking (I'm pretty new to this myself), but
for event, elem in etree.iterparse('xml_try.txt', events=('start', 'end')):
if elem.tag == 'tuv' and event == 'start':
if elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
if elem.find('seg') is not None:
print(elem[2].text)
elif elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'de':
if elem.find('seg') is not None:
print(elem[2].text)
Generates this output:
! no gaps and valves
! keine Spalten und Ventile
Again, apologies if this isn't what you're after.
Upvotes: 1