Reputation: 3215
I am using the following code to try to return a list of tickers on that website. The result of the code is an empty list. I copy the xpath from google chromium developer tools. What am I doing wrong?
from lxml import html
import requests
url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
resp = requests.get(url)
tree = html.fromstring(resp.text)
tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tbody/tr[1]/td[1]/a')
print(tickers)
Upvotes: 1
Views: 460
Reputation: 1121396
Browsers add in missing HTML elements that the HTML specification states are part of the model. lxml
does not add those in.
The most common such element is the <tbody>
element. Your document has no such element, but Chrome does and they put it in your XPath. Another such an element in the <thead>
element; again, the original HTML is lacking it, but Chrome put it in and put the one <tr>
row with <th>
elements in it.
As such the correct XPath expression is:
tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')
e.g. the second row in the table, first table cell in that row.
Note that lxml
can load URLs directly; you don't really need to use requests
in this specific case:
>>> from lxml import html
>>> url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
>>> tree = html.parse(url)
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')
[<Element a at 0x10445e628>]
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')[0].text
'MMM'
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')[0].attrib['href']
'https://www.nyse.com/quote/XNYS:MMM'
If you wanted to extract all <a>
elements in that first column, you'd have to remove the restriction on the <tr>
element; your XPath picks all, remove the [1]
to select all:
links = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr/td[1]/a')
for link in links:
print(link.text, link.attrib['href'])
Upvotes: 2