Reputation: 594
I'm currently trying to get data from an html file. It appears that the code I'm using works, but not as I expect. I can get some items but not all and I'm wondering if it has to do with the size of the file I'm attempting to read.
I'm currently trying to parse the source of this webpage.
This page is 4500 lines long so it is a pretty good size. I've been using this page as I'd like to make sure the code works on large files.
The code I'm using is:
import lxml.html
import lxml
import urllib2
webHTML = urllib2.urlopen('http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html').read()
webHTML = lxml.html.fromstring(webHTML)
productDetails = webHTML.get_element_by_id('productDetails')
for element in productDetails:
print element.text_content()
This gives the expected output when I use an element_id of 'mm3' or something near the top but if I use the element_id of 'productDetails' I get no output. At least I do on my current setup.
Upvotes: 1
Views: 1969
Reputation: 474031
I'm afraid lxml.html
cannot handle parsing this particular HTML source. It parses the h3
tag with id="productDetails"
as an empty element (and this is in a default "recover" mode):
<h3 class="productDescription2" id="productDetails" itemprop="description"></h3>
Switch to BeautifulSoup
with html5lib
parser (it is extremely lenient):
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html'
soup = BeautifulSoup(urlopen(url), 'html5lib')
for element in soup.find(id='productDetails').find_all():
print element.text
Prints:
Looking for the ultimate power system for your next Multi-rotor project? Look no further!The Turnigy Multistar outrunners are designed with one thing in mind - maximising Multi-rotor performance! They feature high-end magnets, high quality bearings and all are precision balanced for smooth running, these motors are engineered specifically for multi-rotor use.These include a prop adapter and have a built in aluminium mount for quick and easy installation on your multi-rotor frame.
outrunner
...
Upvotes: 1