Reputation: 2627
Almost duplicate of scrapy allow all subdomains!
Note: First of all I'm new to Scrapy & I don't have enough reputation to put a comment on this question. So, I decided to ask a new one!
Problem Statement:
I was using BeautifulSoup to scrap email addresses from particular website. It is working fine if email address is available on that particular page (i.e. example.com), but not, if it's available on example.com/contact-us, pretty obvious!
For that reason, I decided to use Scrapy. Though I'm using allowed_domains to get only domain related links it gives me all the offsite links also. And I tried another approach suggested by @agstudy in this question to use SgmlLinkExtractor in rules.
Then I got this error,
Traceback (most recent call last): File "/home/msn/Documents/email_scraper/email_scraper/spiders/emails_spider.py", line 14, in <module> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor File "/home/msn/Documents/scrapy/lib/python3.5/site-packages/scrapy/contrib/linkextractors/sgml.py", line 7, in <module> from scrapy.linkextractors.sgml import * File "/home/msn/Documents/scrapy/lib/python3.5/site-packages/scrapy/linkextractors/sgml.py", line 7, in <module> from sgmllib import SGMLParser ImportError: No module named 'sgmllib'
Basically, ImportError is about deprecation of sgmlib (Simple SGML parser) in Python 3.x
What I've tried so far:
class EmailsSpiderSpider(scrapy.Spider):
name = 'emails'
# allowed_domains = ['example.com']
start_urls = [
'http://example.com/'
]
rules = [
Rule(SgmlLinkExtractor(allow_domains=("example.com"),), callback='parse_url'),
]
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select("//a/@href").extract()
print(set(urls)) # sanity check
I also tried LxmlLinkExtractor with CrawlSpider, but still getting offsite links.
What should I do to get this done? or Is my way of approach to solving the problem is wrong?
Any help would be appreciated!
Another Note: Every time the website will be different to scrap emails. So, I can't use specific HTML or CSS selectors!
Upvotes: 2
Views: 811
Reputation: 1888
You use xpath expression in hxs.select('//a/@href')
which means extract href
attribute values from all a
tags on the page so you get exactly all the links including offsite ones. What you can to use instead is LinkExtractor
and it would be like this:
from scrapy.linkextractors import LinkExtractor
def parse_url(self,
urls = [l.url for l in LinkExtractor(allow_domains='example.com').extract_links(response)]
print(set(urls)) # sanity check
That is what LinkExtractor
is really made for (I guess).
By the way, keep in the mind that most Scrapy examples you can find in Internet (including Stackoverflow) are referred to earlier versions which haven't full compatibility with Python 3.
Upvotes: 1