user4591756
user4591756

Reputation:

Scrapy - parse a url without crawling

I have a list of urls I want to scrape and follow all the same pipelines. How do I begin this? I'm not actually sure where to even start.

The main idea is my crawl works through a site and pages. It then yields to parse the page and update a database. What I am now trying to achieve is to now parse the page of all the existing urls in the database which were not crawled that day.

I have tried doing this in a pipeline using the close_spider method, but can't get these urls to Request/parse. Soon as I yield the whole close_spider method is no longer called.

def close_spider(self, spider):
    for item in models.Items.select().where(models.Items.last_scraped_ts < '2016-02-06 10:00:00'):
        print item.url
        yield Request(item.url, callback=spider.parse_product, dont_filter=True)

Upvotes: 2

Views: 838

Answers (2)

Will Madaus
Will Madaus

Reputation: 169

You could simply copy and paste the urls into start_urls, if you don't have override start_requests, parse will be the default call back. If it is a long list and you don't want ugly code, you can just override the start_requests, open your file or do a db call, and for each item within yield a request for that url and callback to parse. This will let you use your parse function and your pipelines, as well as handle concurrency through scrapy. If you just have a list without that extra infrastructure already existing and the list isn't too long, Sulot's answer is easier.

Upvotes: 0

Sulot
Sulot

Reputation: 514

(re-reading your thread, I am not sure I answering your question at all...) I have done something similar without scrapy but modules lxml and request The url:

listeofurl=['url1','url2']

or if Url have a pattern generate them:

for i in range(0,10):
    url=urlpattern+str(i)

Then I made a loop that parse each url which has the same pattern: import json from lxml import html import requests

listeOfurl=['url1','url2']
mydataliste={};

for url in listeOfurl:
    page = requests.get(url)
    tree = html.fromstring(page.content)
    DataYouWantToKeep= tree.xpath('//*[@id="content"]/div/h2/text()[2]')
    data[url]=DataYouWantToKeep

#and at the end you save all the data in Json
with open(filejson, 'w') as outfile:
    json.dump(data, outfile)

Upvotes: 1

Related Questions