Reputation: 37
<div1 class="tag1">
<div2 class="tag2">
<div3 class="tag3">no</div3>
yes
</div2>
</div1>
I want to parse div1 and I get its text if it has one
and I want to keep {name_class: tag1 (or None), text: None}
,
and I reiterate: {name_class: tag2 , text: yes}, {name_class: tag3 , text: no}
My code to resolve this problem:
from pyquery import PyQuery as pq
a = '<div><div>no</div>yes</div>'
tryy = pq(a)[0]
tmp = [{"text" : tryy.text, "class" : pq(tryy).attr('class')}]
tmp + parse_rec(a)
type(tryy) = lxml.etree._Element'
But the problem is : lxml.etree._Element.text
not keep "yes" contained in div2
I tried this but it does not work with bs4 Only extracting text from this element, not its children
All solutions whatever the library is welcome
Upvotes: 1
Views: 482
Reputation: 16828
Based on the documentation the text "yes" would be considered the tail of the element div3. Using your sample XML, the following code:
from lxml import etree
root = etree.parse("sample.xml")
for element in root.getiterator():
print(f"{element.text.strip()}, {element.attrib['class']}, {element.tail.strip() if element.tail else ''}")
Outputs:
, tag1,
, tag2,
no, tag3, yes
Upvotes: 1