George
George

Reputation: 115

Passing arguements to allowed_domains in Scrapy

I am creating a crawler that takes a user input and crawls for all the links on the site. However I need to limit the crawling and extracting of links to links from that domain only, no outside domains. I got it to where I need it to be in terms of the crawler. My issue is that for my allows_domains function I seem to be unable to pass in the scrapy option put in through the command. Bellow is the first script to run:

# First Script
import os

def userInput():
    user_input = raw_input("Please enter URL. Please do not include http://: ")
    os.system("scrapy runspider -a user_input='http://" + user_input + "' crawler_prod.py")

userInput()

The script it runs is the crawler and the crawler will crawl the domain given. Here is the crawler code:

#Crawler
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import Request
from scrapy.http import Request

class InputSpider(CrawlSpider):
        name = "Input"
        #allowed_domains = ["example.com"]

        def allowed_domains(self):
            self.allowed_domains = user_input

        def start_requests(self):
            yield Request(url=self.user_input)

        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')
        ]

        def parse_item(self, response):
            x = HtmlXPathSelector(response)
            filename = "output.txt"
            open(filename, 'ab').write(response.url + "\n")

I have tried yielding the request sent through by the terminal command however that crashes the crawler. How I have it now also crashes the crawler. I have also tried just putting in allowed_domains=[user_input] and it reports to me that it is not defined. I am playing with the Request library from Scrapy to get this to work with no luck. Is there a better way to restrict crawling outside the given domain?

Edit:

Here is my new code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spiders import BaseSpider
from scrapy import Request
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse
#from run_first import *

class InputSpider(CrawlSpider):
        name = "Input"
        #allowed_domains = ["example.com"]

        #def allowed_domains(self):
            #self.allowed_domains = user_input

        #def start_requests(self):
            #yield Request(url=self.user_input)

        def __init__(self, *args, **kwargs):
            inputs = kwargs.get('urls', '').split(',') or []
            self.allowed_domains = [urlparse(d).netloc for d in inputs]
            # self.start_urls = [urlparse(c).netloc for c in inputs] # For start_urls

        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')
        ]

        def parse_item(self, response):
            x = HtmlXPathSelector(response)
            filename = "output.txt"
            open(filename, 'ab').write(response.url + "\n")

This is the output log for the new code

2017-04-18 18:18:01 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2017-04-18 18:18:01 [scrapy] INFO: Optional features available: ssl, http11, boto
2017-04-18 18:18:01 [scrapy] INFO: Overridden settings: {'LOG_FILE': 'output.log'}
2017-04-18 18:18:43 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2017-04-18 18:18:43 [scrapy] INFO: Optional features available: ssl, http11, boto
2017-04-18 18:18:43 [scrapy] INFO: Overridden settings: {'LOG_FILE': 'output.log'}
2017-04-18 18:18:43 [py.warnings] WARNING: /home/****-you/Python_Projects/Network-Multitool/crawler/crawler_prod.py:1: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
  from scrapy.contrib.spiders import CrawlSpider, Rule

2017-04-18 18:18:43 [py.warnings] WARNING: /home/****-you/Python_Projects/Network-Multitool/crawler/crawler_prod.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

2017-04-18 18:18:43 [py.warnings] WARNING: /home/****-you/Python_Projects/Network-Multitool/crawler/crawler_prod.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

2017-04-18 18:18:43 [py.warnings] WARNING: /home/****-you/Python_Projects/Network-Multitool/crawler/crawler_prod.py:27: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
  Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')

2017-04-18 18:18:43 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2017-04-18 18:18:43 [boto] DEBUG: Retrieving credentials from metadata server.
2017-04-18 18:18:44 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2017-04-18 18:18:44 [boto] ERROR: Unable to read instance data, giving up
2017-04-18 18:18:44 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-04-18 18:18:44 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-04-18 18:18:44 [scrapy] INFO: Enabled item pipelines: 
2017-04-18 18:18:44 [scrapy] INFO: Spider opened
2017-04-18 18:18:44 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-18 18:18:44 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-18 18:18:44 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: 
2017-04-18 18:18:44 [scrapy] INFO: Closing spider (finished)
2017-04-18 18:18:44 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 18, 22, 18, 44, 794155),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 3,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2017, 4, 18, 22, 18, 44, 790331)}
2017-04-18 18:18:44 [scrapy] INFO: Spider closed (finished)

Edit:

I was able to figure out the answer to my issue by looking into the answers and rereading the docs. Below is what I added to the crawler script to get it to work.

def __init__(self, url=None, *args, **kwargs):
    super(InputSpider, self).__init__(*args, **kwargs)
    self.allowed_domains = [url]
    self.start_urls = ["http://" + url]

Upvotes: 4

Views: 4857

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21406

There are few things you are missing here.

  1. First requests that come from start_urls are not being filtered.
  2. You cannot override allowed_domains once the run starts.

To deal with these issues you need to write your own offiste middleware, or at least modify the existing one with the changes you need.

What happens is OffsiteMiddleware that handles allowed_domains converts allowed_domains value to regular expressions string once the spider opens and then that parameter is never used again.

Add something like this to you middlewares.py:

from scrapy.spidermiddlewares.offsite import OffsiteMiddleware
from scrapy.utils.httpobj import urlparse_cached
class MyOffsiteMiddleware(OffsiteMiddleware):

    def should_follow(self, request, spider):
        """Return bool whether to follow a request"""
        # hostname can be None for wrong urls (like javascript links)
        host = urlparse_cached(request).hostname or ''
        if host in spider.allowed_domains:
            return True
        return False

Activate it in setting.py:

SPIDER_MIDDLEWARES = {
    # enable our middleware
    'myspider.middlewares.MyOffsiteMiddleware': 500,
    # disable old middleware
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, 

}

Now your spider should follow anything that you have in allowed_domains, even if you modify it mid run.

Edit: for your case:

from scrapy.utils.httpobj import urlparse
class MySpider(Spider):
    def __init__(self, *args, **kwargs):
        input = kwargs.get('urls', '').split(',') or []
        self.allowed_domains = [urlparse(d).netloc for d in input]

And now you can run:

scrapy crawl myspider -a "urls=foo.com,bar.com"

Upvotes: 6

Related Questions