Reputation: 499
I have some HTML that I want to parse with lxml using Python. There are a number of elements on the page that each represent a poster. I want to grab each poster's ID, so that I can then scrape a piece of information off the poster's page. Currently the poster's id is stored in the id attribute, so I want to use lxml to get the value of that attribute.
For example:
<div onclick="showDetail(9202)">
<div class="maincard narrower Poster" id="maincard_9202"> </div>
</div>
I want to grab the "maincard_9202" from the id attribute, so that I can then use regex to get the 9202. From there, I can use this value to get directly to the poster's page, since I know that the url redirect pattern goes from
https://nips.cc/Conferences/2017/Schedule?type=Poster (current page) to https://nips.cc/Conferences/2017/Schedule?showEvent=9202 (poster page)
I was trying to use the following code:
from lxml import html
import requests
page = requests.get('https://nips.cc/Conferences/2017/Schedule?type=Poster')
tree = html.fromstring(page.content)
paper_numbers = tree.xpath('//div[@onclick]/id/')
but this returns an empty list.
How can I get the attribute value in this case?
Upvotes: 2
Views: 962
Reputation: 13097
paper_numbers = tree.xpath('//div[@onclick]/div/@id')
print(paper_numbers)
would give you
['maincard_9202']
It selects the id
attributes of all div
s inside a div
with the onclick
attribute...
Upvotes: 5