user1287245
user1287245

Reputation: 95

Scrapy/Python - How deal with missing data?

I am writing a scrapy program that captures social network profile URLs from pages (eg facebook, twitter etc).

Some of the pages that I scrape dont have those links on them so the program needs to be able to deal with that.

I have this line of code that finds a Twitter profile link when the link is on the page but fails when the link is not on the page:

item['twitterprofileurl'] = startupdetails.xpath("//a[contains(@href,'https://twitter.com') and not(contains(@href,'https://twitter.com/500startups'))]/@href").extract()[0]

How can I change it so that it so that the code doesn't fail if the link isn't there?

Full code:

import scrapy
from scrapy import Spider
from scrapy.selector import Selector
import datetime
from saas.items import StartupItemTest


class StartupSpider(Spider):
    name = "500cotest"
    allowed_domains = ["500.co"]
    start_urls = [
        "http://500.co/startup/chouxbox/"
    ]

    def parse(self, response):
        startup = Selector(response).xpath('//div[contains(@id, "startup_detail")]')

        for startupdetails in startup:
            item = StartupItemTest()
            item['logo'] = startupdetails.xpath('//img[@class="logo"]/@src').extract()[0]
            item['startupurl'] = startupdetails.xpath('//a[@class="outline"]/@href').extract()[0]
            item['source'] = '500.co'
            item['datetime'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            item['description'] = startupdetails.xpath("//p[@class='description']/text()").extract()[0]

            item['twitterprofileurl'] = startupdetails.xpath("//a[contains(@href,'https://twitter.com') and not(contains(@href,'https://twitter.com/500startups'))]/@href").extract()[0]
            yield item

Upvotes: 0

Views: 863

Answers (1)

Valdir Stumm Junior
Valdir Stumm Junior

Reputation: 4667

Use the .extract_first() method instead of .extract()[0]. It returns None when there's nothing to extract.

So, instead of:

item['twitterprofileurl'] = startupdetails.xpath("<your xpath>").extract()[0]

You'd have:

item['twitterprofileurl'] = startupdetails.xpath("<your xpath>").extract_first()

Upvotes: 2

Related Questions