Christopher Ell
Christopher Ell

Reputation: 2048

Select Next node in Python with XPath

I am trying to scrape population information from Wikipedia country pages. The trouble I am having is that in the node I am trying to scrape there is no information referring to population, instead population is only referenced in the node before it. So using Xpath I am trying to get the expression to move to the next node, but can't find the correct command.

For example for the following page:

https://en.wikipedia.org/wiki/Afghanistan

Below is an xpath expression that gets me to the node before the population number I want to scrape:

//table[@class='infobox geography vcard']//tr[@class = 'mergedtoprow']//a[contains(@href,"Demographics")]/../..

It searches for a href in the table that contains "Demographics" then goes up two levels to the parents of the parents. But the problem is that the title is in a different node to the number I want to extract and so I need something that could go to next node.

I have seen the expression /following-sibling::div[1] but it doesn't seem to work for my expression and I don't know why.

If anyone can think of a more direct way of finding the node in the above web page that would be good too.

Thanks

Edit: Below is the Python code I am using

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from urllib.parse import urljoin



class CountryinfoSpider(scrapy.Spider):
    name = 'CountryInfo'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states_in_the_2020s']

    def parse(self, response):
        ## Extract all countries names
        countries = response.xpath('//table//b//@title').extract()

        for country in countries:
            url = response.xpath('//table//a[@title="'+ country +'"]/@href').extract_first()
            capital = response.xpath('//table//a[@title="'+ country +'"]/../..//i/a/@title').extract()


            absolute_url = urljoin('https://en.wikipedia.org/', url)

            yield Request(absolute_url, callback = self.parse_country)

    def parse_country(self, response):

        test = response.xpath('//table[@class='infobox geography vcard']//tr[@class = 'mergedtoprow']//a[contains(@href,"Demographics")]/../..').extract()

        yield{'Test':test}

It a little more complicated than I explained but I go to the website "List of sovereign states in the 2020s". Copy the country names, capitals and urls. Then I go into the url, after joining it to Wikipedia and try to use the xpath expression I am working on to pull the population.

Thanks

Upvotes: 1

Views: 163

Answers (1)

Tomalak
Tomalak

Reputation: 338406

I think the general answer to your question is: "predicates can be nested".

//table[
  @class='infobox geography vcard'
]//tr[
  @class = 'mergedtoprow' and .//a[contains(@href, "Demographics")]
]/following-sibling::tr[1]/td/text()[1]

Upvotes: 1

Related Questions