TheOneTrueSign
TheOneTrueSign

Reputation: 149

Webscraping with Xpath in Python

From what I've seen the method to derive a path for Xpath to scrape a page is not totally clear to me. I'm trying to use Xpath in python to scrape the wikipedia article for states and capitals to get a list of states and a list of capitals, but so far I've had no luck when trying to figure out the correct path to use. I've tried inspecting the element and copying the Xpath there but I still have had no luck. I'm looking for someone to explain a method to figure out the correct xpath to use to grab certain elements in a page.

from lxml import html
import requests

page = requests.get('https://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States')
tree = html.fromstring(page.text)

#creating list of states
state = tree.xpath('xpath')
#list of capitals
capital = tree.xpath('xpath')

print 'State: ', state
print 'Capital: ', capital

Two of the xpaths I've tried so far have been:

//*[@id="mw-content-text"]/table[1]/tbody/tr[1]/td[1]/a

//*[@id="mw-content-text"]/table[1]/tbody/tr[1]/td[2]

Upvotes: 0

Views: 1719

Answers (1)

larsks
larsks

Reputation: 311328

Start with an expression that will get you the table. Here's one that works:

>>> tree.xpath('//div[@id="mw-content-text"]/table[1]')
[<Element table at 0x7f9dd7322578>]

You want the first table in that div (hence the [1]) and there does not appear to be a tbody element there.

You could iterate over the rows in that table like this:

for row in tree.xpath('//div[@id="mw-content-text"]/table[1]/tr')[1:]:

Within that loop, the state name is:

row[0][0].text

That is the first child of the row (which is a <td> element), and then first child of that (which is an <a> element), and then the text content of that element.

And the capital is:

row[3][0].text

So:

>>> for row in tree.xpath('//div[@id="mw-content-text"]/table[1]/tr')[1:]:
...   st = row[0][0].text
...   cap = row[3][0].text
...   print 'The capital of %s is %s' % (st, cap)
The capital of Alabama is Montgomery
The capital of Alaska is Juneau
The capital of Arizona is Phoenix
[...]

You can get all the state names like this:

>>> tree.xpath('//div[@id="mw-content-text"]/table[1]/tr/td[1]/a/text()')
['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']

Upvotes: 1

Related Questions