Tim Holdsworth
Tim Holdsworth

Reputation: 499

How to get an attribute value with lxml on html

I have some HTML that I want to parse with lxml using Python. There are a number of elements on the page that each represent a poster. I want to grab each poster's ID, so that I can then scrape a piece of information off the poster's page. Currently the poster's id is stored in the id attribute, so I want to use lxml to get the value of that attribute.

For example:

<div onclick="showDetail(9202)">               
    <div class="maincard narrower Poster" id="maincard_9202"> </div>
</div>

I want to grab the "maincard_9202" from the id attribute, so that I can then use regex to get the 9202. From there, I can use this value to get directly to the poster's page, since I know that the url redirect pattern goes from

https://nips.cc/Conferences/2017/Schedule?type=Poster (current page) to https://nips.cc/Conferences/2017/Schedule?showEvent=9202 (poster page)

I was trying to use the following code:

from lxml import html
import requests
page = requests.get('https://nips.cc/Conferences/2017/Schedule?type=Poster')
tree = html.fromstring(page.content)
paper_numbers = tree.xpath('//div[@onclick]/id/')

but this returns an empty list.

How can I get the attribute value in this case?

Upvotes: 2

Views: 962

Answers (1)

ewcz
ewcz

Reputation: 13097

paper_numbers = tree.xpath('//div[@onclick]/div/@id')
print(paper_numbers)

would give you

['maincard_9202']

It selects the id attributes of all divs inside a div with the onclick attribute...

Upvotes: 5

Related Questions