Reputation: 99
I want to parse the following piece of html from Nasa's website with lxml in python:
<p>
<strong>Launch Date:</strong>1981-09-24<br/>
<strong>Launch Vehicle:</strong> Delta<br/>
<strong>Launch Site:</strong> Cape Canaveral, United States<br/>
<strong>Mass:</strong> 550.0 kg<br/>
</p>
Using the following code of python3:
from lxml.html import parse
page = parse("http://nssdc.gsfc.nasa.gov/nmc/spacecraftDisplay.do?id=1981-096A")
rows = page.xpath('//div[@class="urtwo"]/p')[0]
for element in rows:
print(element.xpath("string()"))
But the values after the heads are empty...:
Launch Date:
Launch Vehicle:
Launch Site:
Mass:
I think it has to do something with <'/strong> or <'br />.
Can anyone help me to find the solution?
Upvotes: 2
Views: 100
Reputation: 474191
How about iterating over strong
tags thinking about them as labels and getting the following text siblings as values:
rows = page.xpath('//div[@class="urtwo"]/p//strong')
for element in rows:
label = element.text.strip()
value = element.xpath("following-sibling::text()")[0].strip()
print(label, value)
Prints:
('Launch Date:', u'1981-09-24')
(u'Launch\xa0Vehicle:', u'Delta')
(u'Launch\xa0Site:', u'Cape Canaveral, United States')
('Mass:', u'550.0\xa0kg')
Upvotes: 1