Scrapy check if scraped url returning any downloadable file or not

Question

I am new in Scrapy and didn't found any help so far.

I want to make a small scraper that can scrape all the url's on the page and then hit them one by one and if Url returns any down-loadable file of any extension then download it and save it into specified location. Here's the code that I have written : items.py

import scrapy

class ZcrawlerItem(scrapy.Item):
    file = scrapy.Field()
    file_url = scrapy.Field()

spider.py

from scrapy import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request

DOMAIN = 'example.com'
URL = 'http://%s' % DOMAIN
from crawler.items import CrawlerItem


class MycrawlerSpider(CrawlSpider):
    name = "mycrawler"
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]
    def parse_dir_contents(self, response):
        print(response.headers)
        item = CrawlerItem()
        item['file_url'] = response.url
        return item       

    def parse(self, response):
        hxs = Selector(response)
        for url in hxs.xpath('//a/@href').extract():
            if (url.startswith('http://') or url.startswith('https://')):
                yield Request(url, callback=self.parse_dir_contents)
        for url in hxs.xpath('//iframe/@src ').extract():
            yield Request(url, callback=self.parse_dir_contents)

The issues that I am facing are the parse_dir_contents not showing header, So it's become difficult to check whether the response data is any down-loadable file or just a content.

BTW I am using Scrapy 1.1.0 and Python 3.4

Any help would be really appreciated!!

Scrapy check if scraped url returning any downloadable file or not

Answers (1)

Related Questions