Reputation: 29511
I have around 10 odd sites that I wish to scrape from. A couple of them are wordpress blogs and they follow the same html structure, albeit with different classes. The others are either forums or blogs of other formats.
The information I like to scrape is common - the post content, the timestamp, the author, title and the comments.
My question is, do i have to create one separate spider for each domain? If not, how can I create a generic spider that allows me scrape by loading options from a configuration file or something similar?
I figured I could load the xpath expressions from a file which location can be loaded via command line but there seems to be some difficulties when scraping for some domain requires that I use regex select(expression_here).re(regex)
while some do not.
Upvotes: 6
Views: 4554
Reputation: 688
You can use start_request method!
and then you can prioritize each url as well! And then on top of that you can pass some meta data!
Here's a sample code that works:
"""
For allowed_domains:
Let’s say your target url is https://www.example.com/1.html,
then add 'example.com' to the list.
"""
class crawler(CrawlSpider):
name = "crawler_name"
allowed_domains, urls_to_scrape = parse_urls()
rules = [
Rule(LinkExtractor(
allow=['.*']),
callback='parse_item',
follow=True)
]
def start_requests(self):
for i,url in enumerate(self.urls_to_scrape):
yield scrapy.Request(url=url.strip(),callback=self.parse_item, priority=i+1, meta={"pass_anydata_hare":1})
def parse_item(self, response):
response = response.css('logic')
yield {'link':str(response.url),'extracted data':[],"meta_data":'data you passed' }
I recommend you to read this page for more info at scrapy
https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests
Hope this helps :)
Upvotes: 0
Reputation: 2776
Well I faced the same issue so I created the spider class dynamically using type()
,
from scrapy.contrib.spiders import CrawlSpider
import urlparse
class GenericSpider(CrawlSpider):
"""a generic spider, uses type() to make new spider classes for each domain"""
name = 'generic'
allowed_domains = []
start_urls = []
@classmethod
def create(cls, link):
domain = urlparse.urlparse(link).netloc.lower()
# generate a class name such that domain www.google.com results in class name GoogleComGenericSpider
class_name = (domain if not domain.startswith('www.') else domain[4:]).title().replace('.', '') + cls.__name__
return type(class_name, (cls,), {
'allowed_domains': [domain],
'start_urls': [link],
'name': domain
})
So say, to create a spider for 'http://www.google.com' I'll just do -
In [3]: google_spider = GenericSpider.create('http://www.google.com')
In [4]: google_spider
Out[4]: __main__.GoogleComGenericSpider
In [5]: google_spider.name
Out[5]: 'www.google.com'
Hope this helps
Upvotes: 3
Reputation: 1540
You can use a empty allowed_domains
attribute to instruct scrapy not to filter any offsite request. But in that case you must be careful and only return relevant requests from your spider.
Upvotes: 1
Reputation: 21
I do sort of the same thing using the following XPath expressions:
'/html/head/title/text()'
for the title//p[string-length(text()) > 150]/text()
for the post content.Upvotes: 1
Reputation: 4188
At scrapy spider set the allowed_domains to a list of domains for example :
class YourSpider(CrawlSpider):
allowed_domains = [ 'domain1.com','domain2.com' ]
hope it helps
Upvotes: 3
Reputation: 22515
You should use BeautifulSoup especially if you're using Python. It enables you to find elements in the page, and extract text using regular expressions.
Upvotes: 0