Reputation:
I have a list of urls I want to scrape and follow all the same pipelines. How do I begin this? I'm not actually sure where to even start.
The main idea is my crawl works through a site and pages. It then yields to parse the page and update a database. What I am now trying to achieve is to now parse the page of all the existing urls in the database which were not crawled that day.
I have tried doing this in a pipeline using the close_spider
method, but can't get these urls to Request/parse. Soon as I yield the whole close_spider
method is no longer called.
def close_spider(self, spider):
for item in models.Items.select().where(models.Items.last_scraped_ts < '2016-02-06 10:00:00'):
print item.url
yield Request(item.url, callback=spider.parse_product, dont_filter=True)
Upvotes: 2
Views: 838
Reputation: 169
You could simply copy and paste the urls into start_urls, if you don't have override start_requests, parse will be the default call back. If it is a long list and you don't want ugly code, you can just override the start_requests, open your file or do a db call, and for each item within yield a request for that url and callback to parse. This will let you use your parse function and your pipelines, as well as handle concurrency through scrapy. If you just have a list without that extra infrastructure already existing and the list isn't too long, Sulot's answer is easier.
Upvotes: 0
Reputation: 514
(re-reading your thread, I am not sure I answering your question at all...) I have done something similar without scrapy but modules lxml and request The url:
listeofurl=['url1','url2']
or if Url have a pattern generate them:
for i in range(0,10):
url=urlpattern+str(i)
Then I made a loop that parse each url which has the same pattern: import json from lxml import html import requests
listeOfurl=['url1','url2']
mydataliste={};
for url in listeOfurl:
page = requests.get(url)
tree = html.fromstring(page.content)
DataYouWantToKeep= tree.xpath('//*[@id="content"]/div/h2/text()[2]')
data[url]=DataYouWantToKeep
#and at the end you save all the data in Json
with open(filejson, 'w') as outfile:
json.dump(data, outfile)
Upvotes: 1