Parse an HTML element using the pyquery library or beautifulsoup. or a different alternative

Question


  
    no
    yes

I want to parse div1 and I get its text if it has one and I want to keep {name_class: tag1 (or None), text: None}, and I reiterate: {name_class: tag2 , text: yes}, {name_class: tag3 , text: no}

My code to resolve this problem:

from pyquery import PyQuery as pq

a = 'no
yes'
tryy = pq(a)[0]

tmp = [{"text" : tryy.text, "class" : pq(tryy).attr('class')}]
tmp + parse_rec(a)

type(tryy) = lxml.etree._Element' But the problem is : lxml.etree._Element.text not keep "yes" contained in div2

I tried this but it does not work with bs4 Only extracting text from this element, not its children

All solutions whatever the library is welcome

Garett · Accepted Answer

Based on the documentation the text "yes" would be considered the tail of the element div3. Using your sample XML, the following code:

from lxml import etree

root = etree.parse("sample.xml")

for element in root.getiterator():
    print(f"{element.text.strip()}, {element.attrib['class']}, {element.tail.strip() if element.tail else ''}")

Outputs:

, tag1, 
, tag2, 
no, tag3, yes

Parse an HTML element using the pyquery library or beautifulsoup. or a different alternative

Answers (1)

Related Questions