Reputation: 941
I'm very, VERY new to web scraping and I'm still learning as I go. Currently, I'm using Python and Scrapy to build my own web scraper but I encountered something really odd.
I tried to go to scrape this webpage right here just as an exercise: https://worldpopulationreview.com/countries/countries-by-national-debt
That's basically a webpage which lists the Debt to GDP ratio for various countries in the world. Now, if you noticed, Sudan does not have any Population number recorded in the table on that web page.
I tried to scrape the population for each country from that web page using this xpath expression:
import scrapy
import pandas as pd
class GdpDebtSpider(scrapy.Spider):
name = 'gdp_debt'
allowed_domains = ['worldpopulationreview.com']
start_urls = ['https://worldpopulationreview.com/countries/countries-by-national-debt/']
def parse(self, response):
populations = response.xpath("//tbody/tr/td[3]/text()").getall()
The problem here is that it seems like with the xpath expression above which is
"//tbody/tr/td[3]/text()"
it's unable to capture the empty population table cell in Sudan, it basically skips the population of Sudan entirely because I believe the td
element does not contain any text node.
Is there any solution to this where we can extract elements without any text node as an empty string like this: ''
instead of skipping it entirely?
Thanks so much everyone!
Upvotes: 1
Views: 405
Reputation: 28236
I'm assuming you're doing something like this:
zip()
them and yield itemsI see this approach a lot, but in many cases (whenever not all the information is present), this is not correct.
This way of doing it can result in data being wrong or missing.
What to do instead? Write an explicit loop.
Loop over items/rows, and for each item:
In code, that would look something like this:
for row in response.xpath('//tbody/tr'):
yield {
'name': row.xpath('./td[1]//text()').get(),
'population': row.xpath('./td[3]//text()').get()
}
Doing it this way ensures correct names will be associated with correct population, and you can let your item exporter take care of correctly handling the missing information.
Upvotes: 1