Python iterate through section using lxml

Question

I have a webpage that I am currently parsing using BeautifulSoup but it is quite slow so I have decided to try lxml as I read it is very fast.

Anyway, I am struggling to get my code to iterate over the section I want, not sure how to use lxml and I can't find clear documentation on it.

Anyway, here is my code:

import urllib, urllib2
from lxml import etree

def wgetUrl(target):
    try:
        req = urllib2.Request(target)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
        response = urllib2.urlopen(req)
        outtxt = response.read()
        response.close()
    except:
        return ''
    return outtxt

newUrl = 'http://www.tv3.ie/3player'

data = wgetUrl(newUrl)
parser = etree.HTMLParser()
tree   = etree.fromstring(data, parser)

for elem in tree.iter("div"):
    print elem.tag, elem.attrib, elem.text

This returns all the DIV's but how do I specify to only iterate through dev id='slider1'?

div {'style': 'position: relative;', 'id': 'slider1'} None

This does not work:

for elem in tree.iter("slider1"):

I know this is probably a dumb question but I can't figure it out..

Thanks!

* EDIT **

With your help adding this code I now have the output below:

for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow']"):
    print elem[0].tag, elem[0].attrib, elem[0].text
    print elem[1].tag, elem[1].attrib, elem[1].text
    print elem[2].tag, elem[2].attrib, elem[2].text
    print elem[3].tag, elem[3].attrib, elem[3].text
    print elem[4].tag, elem[4].attrib, elem[4].text

Output:

a {'href': '/3player/show/392/57922/1/Tallafornia', 'title': '3player | Tallafornia, 11/01/2013. The Tallafornia crew are back, living in a beachside villa in Santa Ponsa, Majorca. As the crew settle in, the egos grow bigger than ever and cause tension'} None
h3 {} None
span {'id': 'gridcaption'} The Tallafornia crew are back, living in a beachside vill...
span {'id': 'griddate'} 11/01/2013
span {'id': 'gridduration'} 00:27:52

That is all brilliant but I am missing a part of the a tag above. Would the parser be not handling the code correctly?

I'm not getting the following:

Any ideas why It doesn't pull this?

Thanks again, very helpful posts..

isedev · Accepted Answer

You can use an XPath expression as follows:

for elem in tree.xpath("//div[@id='slider1']"):

Example:

>>> import urllib2
>>> import lxml.etree
>>> url = 'http://www.tv3.ie/3player'
>>> data = urllib2.urlopen(url)
>>> parser = lxml.etree.HTMLParser()
>>> tree = lxml.etree.parse(data,parser)
>>> elem = tree.xpath("//div[@id='slider1']")
>>> elem[0].attrib
{'style': 'position: relative;', 'id': 'slider1'}

You need to better analyse the contents of the page you are processing (a good way to do this is to use Firefox with the Firebug add-on).

The tag you are trying to obtain is actually a child of the tag:

>>> for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow']"):
...    for elem_a in elem.xpath("./a"):
...       for elem_img in elem_a.xpath("./img"):
...          print ' HREF=%s'%(elem_a.attrib['href'])
...          print ' ALT="%s"'%(elem_img.attrib['alt'])
 HREF=/3player/show/392/58784/1/Tallafornia
 ALT="3player | Tallafornia, 01/02/2013. A fresh romance blossoms in the Tallafornia house. Marc challenges Cormac to a 'bench off' in the gym"
 HREF=/3player/show/46/58765/1/Coronation-Street
 ALT="3player | Coronation Street, 01/02/2013. Tyrone bumps into Kirsty in the street and tries to take Ruby from her pram"
../..

Python iterate through section using lxml

Answers (2)

Related Questions