Nathan monnier
Nathan monnier

Reputation: 37

Parse an HTML element using the pyquery library or beautifulsoup. or a different alternative

<div1 class="tag1">
  <div2 class="tag2">
    <div3 class="tag3">no</div3>
    yes
  </div2>
</div1>

I want to parse div1 and I get its text if it has one and I want to keep {name_class: tag1 (or None), text: None}, and I reiterate: {name_class: tag2 , text: yes}, {name_class: tag3 , text: no}

My code to resolve this problem:

from pyquery import PyQuery as pq

a = '<div><div>no</div>yes</div>'
tryy = pq(a)[0]

tmp = [{"text" : tryy.text, "class" : pq(tryy).attr('class')}]
tmp + parse_rec(a)

type(tryy) = lxml.etree._Element' But the problem is : lxml.etree._Element.text not keep "yes" contained in div2

I tried this but it does not work with bs4 Only extracting text from this element, not its children

All solutions whatever the library is welcome

Upvotes: 1

Views: 482

Answers (1)

Garett
Garett

Reputation: 16828

Based on the documentation the text "yes" would be considered the tail of the element div3. Using your sample XML, the following code:

from lxml import etree

root = etree.parse("sample.xml")

for element in root.getiterator():
    print(f"{element.text.strip()}, {element.attrib['class']}, {element.tail.strip() if element.tail else ''}")

Outputs:

, tag1, 
, tag2, 
no, tag3, yes

Upvotes: 1

Related Questions