How to parse a htmlpage with lxml with
screwing up?

Question

I want to parse the following piece of html from Nasa's website with lxml in python:

    
        Launch Date:1981-09-24

        Launch Vehicle: Delta

        Launch Site: Cape Canaveral, United States

        Mass: 550.0 kg

Using the following code of python3:

from lxml.html import parse

page = parse("http://nssdc.gsfc.nasa.gov/nmc/spacecraftDisplay.do?id=1981-096A")

rows = page.xpath('//div[@class="urtwo"]/p')[0]
for element in rows:
    print(element.xpath("string()"))

But the values after the heads are empty...:

Launch Date:

Launch Vehicle:

Launch Site:

Mass:

I think it has to do something with <'/strong> or <'br />.

Can anyone help me to find the solution?

alecxe · Accepted Answer

How about iterating over strong tags thinking about them as labels and getting the following text siblings as values:

rows = page.xpath('//div[@class="urtwo"]/p//strong')
for element in rows:
    label = element.text.strip()
    value = element.xpath("following-sibling::text()")[0].strip()

    print(label, value)

Prints:

('Launch Date:', u'1981-09-24')
(u'Launch\xa0Vehicle:', u'Delta')
(u'Launch\xa0Site:', u'Cape Canaveral, United States')
('Mass:', u'550.0\xa0kg')

How to parse a htmlpage with lxml with <br /> screwing up?

Answers (1)

Related Questions

How to parse a htmlpage with lxml with &lt;br /&gt; screwing up?

Answers (1)

Related Questions

How to parse a htmlpage with lxml with <br /> screwing up?