pri0ritize
pri0ritize

Reputation: 594

Python and lxml.html get_element_by_id output questions

I'm currently trying to get data from an html file. It appears that the code I'm using works, but not as I expect. I can get some items but not all and I'm wondering if it has to do with the size of the file I'm attempting to read.

I'm currently trying to parse the source of this webpage.

This page is 4500 lines long so it is a pretty good size. I've been using this page as I'd like to make sure the code works on large files.

The code I'm using is:

import lxml.html
import lxml
import urllib2

webHTML = urllib2.urlopen('http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html').read()
webHTML = lxml.html.fromstring(webHTML)
productDetails = webHTML.get_element_by_id('productDetails')
for element in productDetails:
    print element.text_content()

This gives the expected output when I use an element_id of 'mm3' or something near the top but if I use the element_id of 'productDetails' I get no output. At least I do on my current setup.

Upvotes: 1

Views: 1969

Answers (1)

alecxe
alecxe

Reputation: 474031

I'm afraid lxml.html cannot handle parsing this particular HTML source. It parses the h3 tag with id="productDetails" as an empty element (and this is in a default "recover" mode):

<h3 class="productDescription2" id="productDetails" itemprop="description"></h3>

Switch to BeautifulSoup with html5lib parser (it is extremely lenient):

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html'
soup = BeautifulSoup(urlopen(url), 'html5lib')

for element in soup.find(id='productDetails').find_all():
    print element.text

Prints:

Looking for the ultimate power system for your next Multi-rotor project? Look no further!The Turnigy Multistar outrunners are designed with one thing in mind - maximising Multi-rotor performance! They feature high-end magnets, high quality bearings and all are precision balanced for smooth running, these motors are engineered specifically for multi-rotor use.These include a prop adapter and have a built in aluminium mount for quick and easy installation on your multi-rotor frame.

outrunner

...

Upvotes: 1

Related Questions