Reputation: 101
I have written a spider to crawl a website. I am able to generate all the page urls (pagination). I need help to crawl all these pages and and then print the response.
url_string="http://website.com/ct-50658/page-"
class SpiderName(Spider):
name="website"
allowed_domains=["website.com"]
start_urls=["http://website.com/page-2"]
def printer(self, response):
hxs=HtmlXPathSelector(response)
x=hxs.select("//span/a/@title").extract()
with open('website.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=' ',quotechar='|', quoting=csv.QUOTE_MINIMAL)
for i in x:
spamwriter.writerow(i)
def parse(self,response):
hxs=HtmlXPathSelector(response)
#sel=Selector(response)
pages=hxs.select("//div[@id='srchpagination']/a/@href").extract()
total_pages=int(pages[-2][-2:])
j=0
url_list=[]
while (j<total_pages):
j=j+1
urls=url_string+str(j)
url_list.append(urls)
for one_url in url_list:
request= Request(one_url, callback=self.printer)
return request
Upvotes: 0
Views: 433
Reputation: 20748
You're recreating the 'website.csv' file for every one_url
Request's response. You should probably create it once (in __init__
for example) and save a CSV Writer reference to it in an attribute of your spider, referencing something like self.csvwriter in def printer.
Also, in for one_url in url_list:
loop, you should use yield Request(one_url, callback=self.printer)
. Here, you're returning only the last Request
Here's a sample spider with these modifications and some code simplifications:
class SpiderName(Spider):
name="website"
allowed_domains=["website.com"]
start_urls=["http://website.com/page-2"]
def __init__(self, category=None, *args, **kwargs):
super(SpiderName, self).__init__(*args, **kwargs)
self.spamwriter = csv.writer(open('website.csv', 'wb'),
delimiter=' ',
quotechar='|',
quoting=csv.QUOTE_MINIMAL)
def printer(self, response):
hxs = HtmlXPathSelector(response)
for i in hxs.select("//span/a/@title").extract():
self.spamwriter.writerow(i)
def parse(self,response):
hxs=HtmlXPathSelector(response)
#sel=Selector(response)
pages = hxs.select("//div[@id='srchpagination']/a/@href").extract()
total_pages = int(pages[-2][-2:])
while j in range(0, total_pages):
yield Request(url_string+str(j), callback=self.printer)
Upvotes: 1