Aran Freel
Aran Freel

Reputation: 3215

Python 3.4 : LXML web scraping

I am using the following code to try to return a list of tickers on that website. The result of the code is an empty list. I copy the xpath from google chromium developer tools. What am I doing wrong?

from lxml import html
import requests


url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

resp = requests.get(url)
tree = html.fromstring(resp.text)

tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tbody/tr[1]/td[1]/a')

print(tickers)

Upvotes: 1

Views: 460

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121396

Browsers add in missing HTML elements that the HTML specification states are part of the model. lxml does not add those in.

The most common such element is the <tbody> element. Your document has no such element, but Chrome does and they put it in your XPath. Another such an element in the <thead> element; again, the original HTML is lacking it, but Chrome put it in and put the one <tr> row with <th> elements in it.

As such the correct XPath expression is:

tickers = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')

e.g. the second row in the table, first table cell in that row.

Note that lxml can load URLs directly; you don't really need to use requests in this specific case:

>>> from lxml import html
>>> url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
>>> tree = html.parse(url)
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')
[<Element a at 0x10445e628>]
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')[0].text
'MMM'
>>> tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr[2]/td[1]/a')[0].attrib['href']
'https://www.nyse.com/quote/XNYS:MMM'

If you wanted to extract all <a> elements in that first column, you'd have to remove the restriction on the <tr> element; your XPath picks all, remove the [1] to select all:

links = tree.xpath(r'//*[@id="mw-content-text"]/table[1]/tr/td[1]/a')
for link in links:
    print(link.text, link.attrib['href'])

Upvotes: 2

Related Questions