Scrapy is not filtering results as per allowed_domains

Question

Almost duplicate of scrapy allow all subdomains!

Note: First of all I'm new to Scrapy & I don't have enough reputation to put a comment on this question. So, I decided to ask a new one!

Problem Statement:

I was using BeautifulSoup to scrap email addresses from particular website. It is working fine if email address is available on that particular page (i.e. example.com), but not, if it's available on example.com/contact-us, pretty obvious!

For that reason, I decided to use Scrapy. Though I'm using allowed_domains to get only domain related links it gives me all the offsite links also. And I tried another approach suggested by @agstudy in this question to use SgmlLinkExtractor in rules.

Then I got this error,

Traceback (most recent call last):     
    File "/home/msn/Documents/email_scraper/email_scraper/spiders/emails_spider.py", line 14, in 
        from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor  
    File "/home/msn/Documents/scrapy/lib/python3.5/site-packages/scrapy/contrib/linkextractors/sgml.py", line 7, in   
      from scrapy.linkextractors.sgml import *  
    File "/home/msn/Documents/scrapy/lib/python3.5/site-packages/scrapy/linkextractors/sgml.py", line 7, in   
      from sgmllib import SGMLParser  
ImportError: No module named 'sgmllib'

Basically, ImportError is about deprecation of sgmlib (Simple SGML parser) in Python 3.x

What I've tried so far:

class EmailsSpiderSpider(scrapy.Spider):
    name = 'emails'
    # allowed_domains = ['example.com']
    start_urls = [
        'http://example.com/'
    ]

    rules = [
        Rule(SgmlLinkExtractor(allow_domains=("example.com"),), callback='parse_url'),
    ]

    def parse_url(self, response):
        hxs = HtmlXPathSelector(response)
        urls = hxs.select("//a/@href").extract()
        print(set(urls))  # sanity check

I also tried LxmlLinkExtractor with CrawlSpider, but still getting offsite links.

What should I do to get this done? or Is my way of approach to solving the problem is wrong?

Any help would be appreciated!

Another Note: Every time the website will be different to scrap emails. So, I can't use specific HTML or CSS selectors!

mizhgun · Accepted Answer

You use xpath expression in hxs.select('//a/@href') which means extract href attribute values from all a tags on the page so you get exactly all the links including offsite ones. What you can to use instead is LinkExtractor and it would be like this:

from scrapy.linkextractors import LinkExtractor

def parse_url(self, 
    urls = [l.url for l in LinkExtractor(allow_domains='example.com').extract_links(response)]
    print(set(urls))  # sanity check

That is what LinkExtractor is really made for (I guess).

By the way, keep in the mind that most Scrapy examples you can find in Internet (including Stackoverflow) are referred to earlier versions which haven't full compatibility with Python 3.

Scrapy is not filtering results as per allowed_domains

Answers (1)

Related Questions