Reputation: 2764
Trying to send a 'list' of urls to scrapy to crawl via a certain spider via using a long string, then splitting the string inside the crawler. I've tried copying the format that was given in this answer.
The list I'm trying to send to the crawler is future_urls
>>> print future_urls
set(['https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'http://finance.yahoo.com/q/hp?s=TFW.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=DLTR&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=AGNC&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=HMSY&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'http://finance.yahoo.com/q/hp?s=BATS.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m'])
Then sending it to the crawler through:
command4 = ("scrapy crawl future -o future_portfolios_{0} -t csv -a future_urls={1}").format(input_file, str(','.join(list(future_urls))))
>>> print command4
scrapy crawl future -o future_portfolios_input_10062008_10062012_ver_1.csv -t csv -a future_urls=https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,http://finance.yahoo.com/q/hp?s=TFW.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=DLTR&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=AGNC&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=HMSY&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,http://finance.yahoo.com/q/hp?s=BATS.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m
>>> type(command4)
<type 'str'>
My crawler (partial):
class FutureSpider(scrapy.Spider):
name = "future"
allowed_domains = ["finance.yahoo.com", "ca.finance.yahoo.com"]
start_urls = ['https://ca.finance.yahoo.com/q/hp?s=%5EIXIC']
def __init__(self, *args, **kwargs):
super(FutureSpider, self).__init__(*args,**kwargs)
self.future_urls = kwargs.get('future_urls').split(',')
self.rate_returns_len_min = 12
self.required_amount_of_returns = 12
for x in self.future_urls:
print "Going to scrape:"
print x
def parse(self, response):
if self.future_urls:
for x in self.future_urls:
yield scrapy.Request(x, self.stocks1)
However, what is printed out from print 'going to scrape:', x
is:
Going to scrape:
https://ca.finance.yahoo.com/q/hp?s=ALXN
Only one url, and it's only a portion of the first url in future_urls
which is obviously problematic.
Can't seem to figure out why the crawler won't scrape all of the urls in future_urls
...
Upvotes: 1
Views: 196
Reputation: 356
I think it's stopping when it hits the ampersand (&
), you can escape it by using urllib.quote
.
For example:
import urllib
escapedurl = urllib.quote('https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m')
Then you get it back to normal you can do:
>>>>urllib.unquote(escapedurl)
https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m
Upvotes: 1