Reputation: 83

How to get rid of duplicate links while crawling a website using python scrapy?

I have the following code which crawls the given website address but the problem is that it duplicates the URL while crawling. I need unique and complete list of URL which can be reached from home page of the website.

Please help me in doing that.

############################################################################################
import scrapy
urlset = set()
class MySpider(scrapy.Spider):

    name = "MySpider"

    def __init__(self, allowed_domains=None, start_urls=None):
        super().__init__()

        if allowed_domains is None:
            self.allowed_domains = []
        else:
            self.allowed_domains = allowed_domains
        if start_urls is None:
            self.start_urls = []
        else:
            self.start_urls = start_urls  

    def parse(self, response):
        print('[parse] url:', response.url)
        # extract all links from page
        all_links = response.xpath('*//a/@href').extract()
        all_links = set(all_links)
        all_links = list(all_links)
        # iterate over links
        for link in all_links:
            if("https:" in link or "http:" in link):
                    if(link not in urlset):
                        print('[+] link:', link)

                        full_link = response.urljoin(link)
                        urlset.add(full_link)
                        print("----------Full Link: "+full_link)
                        request = response.follow(full_link, callback=self.parse)
                        yield request
                        yield {'url': response.url}                        



    # def print_this_link(self, response):
    #     print('[print_this_link] url:', response.url)
    #     title = response.xpath('//title/text()').get() # get() will replace extract() in the future
    #     # text = response.xpath('//body/text()').get()
    #     yield {'url': response.url, 'title': title}


# --- run without creating project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file as CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'file://C:/Tmp1/output.csv', # 
})
c.crawl(MySpider)
c.crawl(MySpider, allowed_domains=["copperpodip.com"], start_urls=["https://www.copperpodip.com"])
c.start()

Just run this code as it is. output of the above code

Output of running the code:

C:\Users\Carthaginian\Desktop\projectLink\crawler\crawler\spiders>python stacklink.py
2019-08-22 14:40:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot)
2019-08-22 14:40:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.7.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17134-SP0
2019-08-22 14:40:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'file://C:/Tmp1/output.csv', 'USER_AGENT': 'Mozilla/5.0'}
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet Password: 2feebff3115b2d5b
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider opened
2019-08-22 14:40:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-08-22 14:40:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'file://C:/Tmp1/output.csv', 'USER_AGENT': 'Mozilla/5.0'}
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet Password: b27fd364782f9b57
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider opened
2019-08-22 14:40:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-22 14:40:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.025426,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 8, 22, 9, 10, 17, 695429),
 'log_count/INFO': 19,
 'start_time': datetime.datetime(2019, 8, 22, 9, 10, 17, 670003)}
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider closed (finished)
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com> (referer: None)
[parse] url: https://www.copperpodip.com
[+] link: https://www.copperpodip.com/due-diligence
----------Full Link: https://www.copperpodip.com/due-diligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/22/Patent-Alert-PayPal-Patent-Can-Protect-PCs-From-Ransomware-Attacks
----------Full Link: https://www.copperpodip.com/single-post/2019/04/22/Patent-Alert-PayPal-Patent-Can-Protect-PCs-From-Ransomware-Attacks
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/leadership
----------Full Link: https://www.copperpodip.com/leadership
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
----------Full Link: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/05/20/Patent-Alert-Teslas-New-Patent-Application-Proposes-a-Sunroof-with-Electric-Tinting
----------Full Link: https://www.copperpodip.com/single-post/2019/05/20/Patent-Alert-Teslas-New-Patent-Application-Proposes-a-Sunroof-with-Electric-Tinting
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/prior-art-search
----------Full Link: https://www.copperpodip.com/prior-art-search
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/08/08/Patent-Alert-Amazon-wins-patent-for-spoilage-sniffing-refrigerator
----------Full Link: https://www.copperpodip.com/single-post/2019/08/08/Patent-Alert-Amazon-wins-patent-for-spoilage-sniffing-refrigerator
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/25/Patent-Alert-IBM-Awarded-DLT-Patent-for-Data-Sharing-and-Validation
----------Full Link: https://www.copperpodip.com/single-post/2019/04/25/Patent-Alert-IBM-Awarded-DLT-Patent-for-Data-Sharing-and-Validation
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/patent-monetization
----------Full Link: https://www.copperpodip.com/patent-monetization
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/12/The-Future-is-Green-Energy---HyperSolars-Environment-Friendly-Hydrogen-Generator
----------Full Link: https://www.copperpodip.com/single-post/2019/04/12/The-Future-is-Green-Energy---HyperSolars-Environment-Friendly-Hydrogen-Generator
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/privacy-policy
----------Full Link: https://www.copperpodip.com/privacy-policy
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/ip-news
----------Full Link: https://www.copperpodip.com/ip-news
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com
----------Full Link: https://www.copperpodip.com
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/contact-us
----------Full Link: https://www.copperpodip.com/contact-us
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/blog
----------Full Link: https://www.copperpodip.com/blog
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.linkedin.com/company/copperpod-ip
----------Full Link: https://www.linkedin.com/company/copperpod-ip
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.linkedin.com': <GET https://www.linkedin.com/company/copperpod-ip>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/source-code-review
----------Full Link: https://www.copperpodip.com/source-code-review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/request-for-samples
----------Full Link: https://www.copperpodip.com/request-for-samples
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/01/07/Making-Amends-Chinas-New-Intellectual-Property-Appeals-Court
----------Full Link: https://www.copperpodip.com/single-post/2019/01/07/Making-Amends-Chinas-New-Intellectual-Property-Appeals-Court
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-due-diligence
----------Full Link: https://www.copperpodip.com/case-study-due-diligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28
----------Full Link: https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.siliconindiamagazine.com': <GET https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/reverse-engineering
----------Full Link: https://www.copperpodip.com/reverse-engineering
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/careers
----------Full Link: https://www.copperpodip.com/careers
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/infringement-claim-charts
----------Full Link: https://www.copperpodip.com/infringement-claim-charts
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-infringement-analysis
----------Full Link: https://www.copperpodip.com/case-study-infringement-analysis
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/30/Tokenization-Future-of-Payment-Security
----------Full Link: https://www.copperpodip.com/single-post/2019/04/30/Tokenization-Future-of-Payment-Security
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-source-code-review
----------Full Link: https://www.copperpodip.com/case-study-source-code-review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/08/21/Patent-Alert-WINDGO-granted-IoT-wearable-products-patent-having-sensing-and-response-components
----------Full Link: https://www.copperpodip.com/single-post/2019/08/21/Patent-Alert-WINDGO-granted-IoT-wearable-products-patent-having-sensing-and-response-components
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com/due-diligence> (referer: https://www.copperpodip.com)
[parse] url: https://www.copperpodip.com/due-diligence
[+] link: https://www.facebook.com/copperpodip/
----------Full Link: https://www.facebook.com/copperpodip/
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/copperpodip/>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/due-diligence>
{'url': 'https://www.copperpodip.com/due-diligence'}
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> (referer: https://www.copperpodip.com)
[parse] url: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
[+] link: https://www.copperpodip.com/blog/date/2017-03
----------Full Link: https://www.copperpodip.com/blog/date/2017-03
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/emergingtech
----------Full Link: https://www.copperpodip.com/blog/tag/emergingtech
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-09
----------Full Link: https://www.copperpodip.com/blog/date/2018-09
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-02
----------Full Link: https://www.copperpodip.com/blog/date/2018-02
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/itc
----------Full Link: https://www.copperpodip.com/blog/tag/itc
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/intel
----------Full Link: https://www.copperpodip.com/blog/tag/intel
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/iot
----------Full Link: https://www.copperpodip.com/blog/tag/iot
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/nokia
----------Full Link: https://www.copperpodip.com/blog/tag/nokia
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/fintech
----------Full Link: https://www.copperpodip.com/blog/tag/fintech
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/patents
----------Full Link: https://www.copperpodip.com/blog/tag/patents
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/uber
----------Full Link: https://www.copperpodip.com/blog/tag/uber
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/home%20automation
----------Full Link: https://www.copperpodip.com/blog/tag/home%20automation
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/duediligence
----------Full Link: https://www.copperpodip.com/blog/tag/duediligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/fake%20news
----------Full Link: https://www.copperpodip.com/blog/tag/fake%20news
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/paypal
----------Full Link: https://www.copperpodip.com/blog/tag/paypal
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/virtualreality
----------Full Link: https://www.copperpodip.com/blog/tag/virtualreality
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/author/Arjunvir-Singh
----------Full Link: https://www.copperpodip.com/blog/author/Arjunvir-Singh
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/trademarks
----------Full Link: https://www.copperpodip.com/blog/tag/trademarks
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/qualcomm
----------Full Link: https://www.copperpodip.com/blog/tag/qualcomm
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/Apple
----------Full Link: https://www.copperpodip.com/blog/tag/Apple
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/5g
----------Full Link: https://www.copperpodip.com/blog/tag/5g
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/code%20review
----------Full Link: https://www.copperpodip.com/blog/tag/code%20review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/licensing
----------Full Link: https://www.copperpodip.com/blog/tag/licensing
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/internet%20of%20things
----------Full Link: https://www.copperpodip.com/blog/tag/internet%20of%20things
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-03
----------Full Link: https://www.copperpodip.com/blog/date/2018-03
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/technology
----------Full Link: https://www.copperpodip.com/blog/tag/technology

Upvotes: 2

Answers (3)

gangabass

Reputation: 10666

This will work because Scrapy will deal with duplicate URLs for you:

def parse(self, response):
    yield {'url': response.url}     
    print('[parse] url:', response.url)
    # extract all links from page
    all_links = response.xpath('*//a/@href').extract()
    # iterate over links
    for link in all_links:
        if("https:" in link or "http:" in link):
            print('[+] link:', link)
            full_link = response.urljoin(link)
            print("----------Full Link: "+full_link)
            request = response.follow(full_link, callback=self.parse)
            yield request

Upvotes: 1

tomjn

Reputation: 5390

scrapy should automatically avoid revisiting previously visited urls (using the dupefilter class). It's not entirely clear to me what you want to do here but I think you want to crawl the website and find all links? In that case you should move your second yield (yield {'url': response.url}) to earlier in your parse function.

I think the following gives you what you want:

import scrapy

class MySpider(scrapy.Spider):

    name = "copperpodip"
    start_urls = ["https://copperpodip.com"]
    allowed_domains = ["copperpodip.com"]

    def parse(self, response):
        yield {'url': response.url}
        for link in response.xpath('*//a/@href').getall():
            yield response.follow(link, self.parse)

if I run this as:

scrapy runspider scrapy_test.py -o test.json

then the resulting json file does not contain any duplicate links.

Upvotes: 2

Oka

Reputation: 101

I don't know scrapy at all, but can't you use a list (or a set, that's easier)and check if there's already a record of the same link ?

link_list = list
if link not in link_list :
   link_list.append(link)

Edit : you seem to use a set already, that you change for a list just after :

all_links = set(all_links)
all_links = list(all_links)

Upvotes: 2

How to get rid of duplicate links while crawling a website using python scrapy?

Answers (3)

Related Questions