chrr
chrr

Reputation: 320

Scrapy approach to scraping multiple URLs

I have a project which requires a great deal of data scraping to be done.

I've been looking at Scrapy which so far I am very impressed with but I am looking for the best approach to do the following:

1) I want to scrape multiple URL's and pass in the same variable for each URL to be scraped, for example, lets assume I am wanting to return the top result for the keyword "python" from Bing, Google and Yahoo.

I would want to scrape http://www.google.co.uk/q=python, http://www.yahoo.com?q=python and http://www.bing.com/?q=python (not the actual URLs but you get the idea)

I can't find a way to specify dynamic URLs using the keyword, the only option I can think of is to generate a file in PHP or other which builds the URL and specify scrapy to crawl the links in the URL.

2) Obviously each search engine would have its own mark-up so I would need to differentiate between each result to find the corresponding XPath to extract the relevant data from

3) Lastly, I would like to write the results of the scraped Item to a database (probably redis), but only when all 3 URLs have finished scraping, essentially I am wanting to build up a "profile" from the 3 search engines and save the outputted result in one transaction.

If anyone has any thoughts on any of these points I would be very grateful.

Thank you

Upvotes: 1

Views: 2909

Answers (3)

tsing
tsing

Reputation: 1571

you could use '-a' switch to specify a key-value pair to the spider, which could indicate a particular search words

scrapy crawl <spider_name> -a search_word=python

Upvotes: 0

Peter Kirby
Peter Kirby

Reputation: 1985

1) In the BaseSpider, there is an __init__ method that can be overridden in subclasses. This is where the declaration of the start_urls and allowed_domains variables are set. If you have a list of urls in mind, prior to running the spider, than you can insert them dynamically here.

For example, in a few of the spiders I have built, I pull in preformatted groups of URL's from MongoDB, and insert them into the start_urls list in once bulk insert.

2)This might be a little bit more tricky, but you could easily see the crawled URL by looking in the response object (response.url). You should be able to check to see if the url contains 'google', 'bing', or 'yahoo', and then use the prespecified selectors for a url of that type.

3) I am not so sure that #3 is possible, or at least not without some difficulty. As far as I know, the url's in the start_urls list are not crawled orderly, and they each arrive in the pipeline independently. I am not sure that without some serious core hacking, you will be able to collect a group of response objects and pass them into a pipeline together.

However, you might consider serializing the data to disk temporarily, and then bulk-saving the data later on to your database. One of the crawlers I built receives groups of URLs that are around 10000 in number. Rather than making 10000 single item database insertions, I store the urls (and collected data) in BSON, and than insert it into MongoDB later.

Upvotes: 3

Oren Mazor
Oren Mazor

Reputation: 4487

I would use mechanize for this.

import mechanize
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.set_handle_robots(False)
response = br.open('https://www.google.ca/search?q=python')
links = list(br.links())

which gives you all of the links. or you can filter them out by class:

links = [aLink for aLink in br.links()if ('class','l') in aLink.attrs]

Upvotes: 1

Related Questions