Python: Problems sending 'list' of urls to scrapy spider to scrape

Question

Trying to send a 'list' of urls to scrapy to crawl via a certain spider via using a long string, then splitting the string inside the crawler. I've tried copying the format that was given in this answer.

The list I'm trying to send to the crawler is future_urls

    >>> print future_urls
    set(['https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'http://finance.yahoo.com/q/hp?s=TFW.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=DLTR&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=AGNC&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=HMSY&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'http://finance.yahoo.com/q/hp?s=BATS.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m'])

Then sending it to the crawler through:

command4 = ("scrapy crawl future -o future_portfolios_{0} -t csv -a future_urls={1}").format(input_file, str(','.join(list(future_urls))))

>>> print command4
scrapy crawl future -o future_portfolios_input_10062008_10062012_ver_1.csv -t csv -a future_urls=https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,http://finance.yahoo.com/q/hp?s=TFW.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=DLTR&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=AGNC&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=HMSY&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,http://finance.yahoo.com/q/hp?s=BATS.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m
>>> type(command4)

My crawler (partial):

class FutureSpider(scrapy.Spider):
name = "future"
allowed_domains = ["finance.yahoo.com", "ca.finance.yahoo.com"]
start_urls = ['https://ca.finance.yahoo.com/q/hp?s=%5EIXIC']

def __init__(self, *args, **kwargs):
    super(FutureSpider, self).__init__(*args,**kwargs)
    self.future_urls = kwargs.get('future_urls').split(',')
    self.rate_returns_len_min = 12
    self.required_amount_of_returns = 12
    for x in self.future_urls:
            print "Going to scrape:"
            print x

def parse(self, response):

    if self.future_urls:
        for x in self.future_urls:
            yield scrapy.Request(x, self.stocks1)

However, what is printed out from print 'going to scrape:', x is:

Going to scrape:
https://ca.finance.yahoo.com/q/hp?s=ALXN

Only one url, and it's only a portion of the first url in future_urls which is obviously problematic.

Can't seem to figure out why the crawler won't scrape all of the urls in future_urls...

adam b · Accepted Answer

I think it's stopping when it hits the ampersand (&), you can escape it by using urllib.quote.

For example:

import urllib

escapedurl = urllib.quote('https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m')

Then you get it back to normal you can do:

>>>>urllib.unquote(escapedurl)
https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m

Python: Problems sending 'list' of urls to scrapy spider to scrape

Answers (1)

Related Questions

Python: Problems sending &#39;list&#39; of urls to scrapy spider to scrape

Answers (1)

Related Questions

Python: Problems sending 'list' of urls to scrapy spider to scrape