vinay_Kumar
vinay_Kumar

Reputation: 117

Parsing all the text inside a tag using lxml in python

I am trying to parse an HTML file which kind of is as shown below

<ol>
  <li>
    <div class="c1">
      <span class="s1">hi</span>
      " hello "
      <span class="s2">world!</span>
    </div>
  </li>
  <li>
    <div class="c2">
      <span class="s3">abc</span>
      " def ghijkl "
      <span class="s1">mno</span>
      " pqr!"
    </div>
  </li>
</ol>

I tried to parse using the following code

tree = html.fromstring(code.content)
sol = tree.xpath('//ol//text()')
for x in sol:
    print x

I get the result as this

hi
 hello 
world!
abc
 def ghijkl
mno
 pqr!

What can I do to get all the text in each <li> tag in one line. i.e. I want the output to be

hi hello world!
abc def ghijkl mno pqr!

Upvotes: 0

Views: 462

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

You can get each li and use normalize-space:

from lxml import html
h = """<ol>
  <li>
    <div class="c1">
      <span class="s1">hi</span>
      " hello "
      <span class="s2">world!</span>
    </div>
  </li>
  <li>
    <div class="c2">
      <span class="s3">abc</span>
      " def ghijkl "
      <span class="s1">mno</span>
      " pqr!"
    </div>
  </li>
</ol>"""


tree = html.fromstring(h)

for li in tree.xpath("//ol/li"):
    print(li.xpath("normalize-space(.)"))

Which gives you:

hi " hello " world!
abc " def ghijkl " mno " pqr!"

Upvotes: 1

Nehal J Wani
Nehal J Wani

Reputation: 16619

$ cat a.py
from lxml import etree

xml = """<ol>
  <li>
    <div class="c1">
      <span class="s1">hi</span>
      " hello "
      <span class="s2">world!</span>
    </div>
  </li>
  <li>
    <div class="c2">
      <span class="s3">abc</span>
      " def ghijkl "
      <span class="s1">mno</span>
      " pqr!"
    </div>
  </li>
</ol>"""

tree = etree.fromstring(xml)
sol = tree.xpath('//ol//li')
for a in sol:
   print " ".join([t.strip() for t in a.itertext()]).strip()

$ python a.py
hi " hello " world!
abc " def ghijkl " mno " pqr!"

Upvotes: 1

Related Questions