Python Scrapy text() function unable to get empty td element

Question

I'm very, VERY new to web scraping and I'm still learning as I go. Currently, I'm using Python and Scrapy to build my own web scraper but I encountered something really odd.

I tried to go to scrape this webpage right here just as an exercise: https://worldpopulationreview.com/countries/countries-by-national-debt

That's basically a webpage which lists the Debt to GDP ratio for various countries in the world. Now, if you noticed, Sudan does not have any Population number recorded in the table on that web page.

I tried to scrape the population for each country from that web page using this xpath expression:

import scrapy
import pandas as pd


class GdpDebtSpider(scrapy.Spider):
    name = 'gdp_debt'
    allowed_domains = ['worldpopulationreview.com']
    start_urls = ['https://worldpopulationreview.com/countries/countries-by-national-debt/']

    def parse(self, response):

        populations = response.xpath("//tbody/tr/td[3]/text()").getall()

The problem here is that it seems like with the xpath expression above which is

"//tbody/tr/td[3]/text()"

it's unable to capture the empty population table cell in Sudan, it basically skips the population of Sudan entirely because I believe the td element does not contain any text node.

Is there any solution to this where we can extract elements without any text node as an empty string like this: '' instead of skipping it entirely?

Thanks so much everyone!

stranac · Accepted Answer

I'm assuming you're doing something like this:

get all names
get all populations
zip() them and yield items

I see this approach a lot, but in many cases (whenever not all the information is present), this is not correct.
This way of doing it can result in data being wrong or missing.

What to do instead? Write an explicit loop.

Loop over items/rows, and for each item:

get that item's name
get that item's population
yield a scrapy item

In code, that would look something like this:

for row in response.xpath('//tbody/tr'):
    yield {
        'name': row.xpath('./td[1]//text()').get(),
        'population': row.xpath('./td[3]//text()').get()
    }

Doing it this way ensures correct names will be associated with correct population, and you can let your item exporter take care of correctly handling the missing information.

Python Scrapy text() function unable to get empty td element

Answers (1)

Related Questions