Reputation: 95
I am writing a scrapy program that captures social network profile URLs from pages (eg facebook, twitter etc).
Some of the pages that I scrape dont have those links on them so the program needs to be able to deal with that.
I have this line of code that finds a Twitter profile link when the link is on the page but fails when the link is not on the page:
item['twitterprofileurl'] = startupdetails.xpath("//a[contains(@href,'https://twitter.com') and not(contains(@href,'https://twitter.com/500startups'))]/@href").extract()[0]
How can I change it so that it so that the code doesn't fail if the link isn't there?
Full code:
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
import datetime
from saas.items import StartupItemTest
class StartupSpider(Spider):
name = "500cotest"
allowed_domains = ["500.co"]
start_urls = [
"http://500.co/startup/chouxbox/"
]
def parse(self, response):
startup = Selector(response).xpath('//div[contains(@id, "startup_detail")]')
for startupdetails in startup:
item = StartupItemTest()
item['logo'] = startupdetails.xpath('//img[@class="logo"]/@src').extract()[0]
item['startupurl'] = startupdetails.xpath('//a[@class="outline"]/@href').extract()[0]
item['source'] = '500.co'
item['datetime'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
item['description'] = startupdetails.xpath("//p[@class='description']/text()").extract()[0]
item['twitterprofileurl'] = startupdetails.xpath("//a[contains(@href,'https://twitter.com') and not(contains(@href,'https://twitter.com/500startups'))]/@href").extract()[0]
yield item
Upvotes: 0
Views: 863
Reputation: 4667
Use the .extract_first()
method instead of .extract()[0]
. It returns None
when there's nothing to extract.
So, instead of:
item['twitterprofileurl'] = startupdetails.xpath("<your xpath>").extract()[0]
You'd have:
item['twitterprofileurl'] = startupdetails.xpath("<your xpath>").extract_first()
Upvotes: 2