Reputation: 83
I have the following code which crawls the given website address but the problem is that it duplicates the URL while crawling. I need unique and complete list of URL which can be reached from home page of the website.
Please help me in doing that.
############################################################################################
import scrapy
urlset = set()
class MySpider(scrapy.Spider):
name = "MySpider"
def __init__(self, allowed_domains=None, start_urls=None):
super().__init__()
if allowed_domains is None:
self.allowed_domains = []
else:
self.allowed_domains = allowed_domains
if start_urls is None:
self.start_urls = []
else:
self.start_urls = start_urls
def parse(self, response):
print('[parse] url:', response.url)
# extract all links from page
all_links = response.xpath('*//a/@href').extract()
all_links = set(all_links)
all_links = list(all_links)
# iterate over links
for link in all_links:
if("https:" in link or "http:" in link):
if(link not in urlset):
print('[+] link:', link)
full_link = response.urljoin(link)
urlset.add(full_link)
print("----------Full Link: "+full_link)
request = response.follow(full_link, callback=self.parse)
yield request
yield {'url': response.url}
# def print_this_link(self, response):
# print('[print_this_link] url:', response.url)
# title = response.xpath('//title/text()').get() # get() will replace extract() in the future
# # text = response.xpath('//body/text()').get()
# yield {'url': response.url, 'title': title}
# --- run without creating project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file as CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'file://C:/Tmp1/output.csv', #
})
c.crawl(MySpider)
c.crawl(MySpider, allowed_domains=["copperpodip.com"], start_urls=["https://www.copperpodip.com"])
c.start()
Just run this code as it is. output of the above code
Output of running the code:
C:\Users\Carthaginian\Desktop\projectLink\crawler\crawler\spiders>python stacklink.py
2019-08-22 14:40:17 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot)
2019-08-22 14:40:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.7.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17134-SP0
2019-08-22 14:40:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'file://C:/Tmp1/output.csv', 'USER_AGENT': 'Mozilla/5.0'}
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet Password: 2feebff3115b2d5b
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider opened
2019-08-22 14:40:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-08-22 14:40:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'file://C:/Tmp1/output.csv', 'USER_AGENT': 'Mozilla/5.0'}
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet Password: b27fd364782f9b57
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-22 14:40:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider opened
2019-08-22 14:40:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-22 14:40:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-22 14:40:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.025426,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 8, 22, 9, 10, 17, 695429),
'log_count/INFO': 19,
'start_time': datetime.datetime(2019, 8, 22, 9, 10, 17, 670003)}
2019-08-22 14:40:17 [scrapy.core.engine] INFO: Spider closed (finished)
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com> (referer: None)
[parse] url: https://www.copperpodip.com
[+] link: https://www.copperpodip.com/due-diligence
----------Full Link: https://www.copperpodip.com/due-diligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/22/Patent-Alert-PayPal-Patent-Can-Protect-PCs-From-Ransomware-Attacks
----------Full Link: https://www.copperpodip.com/single-post/2019/04/22/Patent-Alert-PayPal-Patent-Can-Protect-PCs-From-Ransomware-Attacks
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/leadership
----------Full Link: https://www.copperpodip.com/leadership
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
----------Full Link: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/05/20/Patent-Alert-Teslas-New-Patent-Application-Proposes-a-Sunroof-with-Electric-Tinting
----------Full Link: https://www.copperpodip.com/single-post/2019/05/20/Patent-Alert-Teslas-New-Patent-Application-Proposes-a-Sunroof-with-Electric-Tinting
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/prior-art-search
----------Full Link: https://www.copperpodip.com/prior-art-search
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/08/08/Patent-Alert-Amazon-wins-patent-for-spoilage-sniffing-refrigerator
----------Full Link: https://www.copperpodip.com/single-post/2019/08/08/Patent-Alert-Amazon-wins-patent-for-spoilage-sniffing-refrigerator
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/25/Patent-Alert-IBM-Awarded-DLT-Patent-for-Data-Sharing-and-Validation
----------Full Link: https://www.copperpodip.com/single-post/2019/04/25/Patent-Alert-IBM-Awarded-DLT-Patent-for-Data-Sharing-and-Validation
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/patent-monetization
----------Full Link: https://www.copperpodip.com/patent-monetization
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/12/The-Future-is-Green-Energy---HyperSolars-Environment-Friendly-Hydrogen-Generator
----------Full Link: https://www.copperpodip.com/single-post/2019/04/12/The-Future-is-Green-Energy---HyperSolars-Environment-Friendly-Hydrogen-Generator
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/privacy-policy
----------Full Link: https://www.copperpodip.com/privacy-policy
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/ip-news
----------Full Link: https://www.copperpodip.com/ip-news
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com
----------Full Link: https://www.copperpodip.com
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/contact-us
----------Full Link: https://www.copperpodip.com/contact-us
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/blog
----------Full Link: https://www.copperpodip.com/blog
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.linkedin.com/company/copperpod-ip
----------Full Link: https://www.linkedin.com/company/copperpod-ip
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.linkedin.com': <GET https://www.linkedin.com/company/copperpod-ip>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/source-code-review
----------Full Link: https://www.copperpodip.com/source-code-review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/request-for-samples
----------Full Link: https://www.copperpodip.com/request-for-samples
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/01/07/Making-Amends-Chinas-New-Intellectual-Property-Appeals-Court
----------Full Link: https://www.copperpodip.com/single-post/2019/01/07/Making-Amends-Chinas-New-Intellectual-Property-Appeals-Court
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-due-diligence
----------Full Link: https://www.copperpodip.com/case-study-due-diligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28
----------Full Link: https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.siliconindiamagazine.com': <GET https://www.siliconindiamagazine.com/magazine/patent-and-trademark-law-special-july-2018/#page=28>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/reverse-engineering
----------Full Link: https://www.copperpodip.com/reverse-engineering
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/careers
----------Full Link: https://www.copperpodip.com/careers
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/infringement-claim-charts
----------Full Link: https://www.copperpodip.com/infringement-claim-charts
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-infringement-analysis
----------Full Link: https://www.copperpodip.com/case-study-infringement-analysis
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/04/30/Tokenization-Future-of-Payment-Security
----------Full Link: https://www.copperpodip.com/single-post/2019/04/30/Tokenization-Future-of-Payment-Security
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/case-study-source-code-review
----------Full Link: https://www.copperpodip.com/case-study-source-code-review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
[+] link: https://www.copperpodip.com/single-post/2019/08/21/Patent-Alert-WINDGO-granted-IoT-wearable-products-patent-having-sensing-and-response-components
----------Full Link: https://www.copperpodip.com/single-post/2019/08/21/Patent-Alert-WINDGO-granted-IoT-wearable-products-patent-having-sensing-and-response-components
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com>
{'url': 'https://www.copperpodip.com'}
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com/due-diligence> (referer: https://www.copperpodip.com)
[parse] url: https://www.copperpodip.com/due-diligence
[+] link: https://www.facebook.com/copperpodip/
----------Full Link: https://www.facebook.com/copperpodip/
2019-08-22 14:40:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/copperpodip/>
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/due-diligence>
{'url': 'https://www.copperpodip.com/due-diligence'}
2019-08-22 14:40:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses> (referer: https://www.copperpodip.com)
[parse] url: https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses
[+] link: https://www.copperpodip.com/blog/date/2017-03
----------Full Link: https://www.copperpodip.com/blog/date/2017-03
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/emergingtech
----------Full Link: https://www.copperpodip.com/blog/tag/emergingtech
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-09
----------Full Link: https://www.copperpodip.com/blog/date/2018-09
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-02
----------Full Link: https://www.copperpodip.com/blog/date/2018-02
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/itc
----------Full Link: https://www.copperpodip.com/blog/tag/itc
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/intel
----------Full Link: https://www.copperpodip.com/blog/tag/intel
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/iot
----------Full Link: https://www.copperpodip.com/blog/tag/iot
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/nokia
----------Full Link: https://www.copperpodip.com/blog/tag/nokia
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/fintech
----------Full Link: https://www.copperpodip.com/blog/tag/fintech
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/patents
----------Full Link: https://www.copperpodip.com/blog/tag/patents
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/uber
----------Full Link: https://www.copperpodip.com/blog/tag/uber
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/home%20automation
----------Full Link: https://www.copperpodip.com/blog/tag/home%20automation
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/duediligence
----------Full Link: https://www.copperpodip.com/blog/tag/duediligence
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/fake%20news
----------Full Link: https://www.copperpodip.com/blog/tag/fake%20news
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/paypal
----------Full Link: https://www.copperpodip.com/blog/tag/paypal
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/virtualreality
----------Full Link: https://www.copperpodip.com/blog/tag/virtualreality
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/author/Arjunvir-Singh
----------Full Link: https://www.copperpodip.com/blog/author/Arjunvir-Singh
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/trademarks
----------Full Link: https://www.copperpodip.com/blog/tag/trademarks
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/qualcomm
----------Full Link: https://www.copperpodip.com/blog/tag/qualcomm
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/Apple
----------Full Link: https://www.copperpodip.com/blog/tag/Apple
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/5g
----------Full Link: https://www.copperpodip.com/blog/tag/5g
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/code%20review
----------Full Link: https://www.copperpodip.com/blog/tag/code%20review
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/licensing
----------Full Link: https://www.copperpodip.com/blog/tag/licensing
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/internet%20of%20things
----------Full Link: https://www.copperpodip.com/blog/tag/internet%20of%20things
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/date/2018-03
----------Full Link: https://www.copperpodip.com/blog/date/2018-03
2019-08-22 14:40:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses>
{'url': 'https://www.copperpodip.com/single-post/2019/04/10/Patent-Alert-Sonys-Prescription-VR-Glasses'}
[+] link: https://www.copperpodip.com/blog/tag/technology
----------Full Link: https://www.copperpodip.com/blog/tag/technology
Upvotes: 2
Views: 1299
Reputation: 10666
This will work because Scrapy will deal with duplicate URLs for you:
def parse(self, response):
yield {'url': response.url}
print('[parse] url:', response.url)
# extract all links from page
all_links = response.xpath('*//a/@href').extract()
# iterate over links
for link in all_links:
if("https:" in link or "http:" in link):
print('[+] link:', link)
full_link = response.urljoin(link)
print("----------Full Link: "+full_link)
request = response.follow(full_link, callback=self.parse)
yield request
Upvotes: 1
Reputation: 5390
scrapy should automatically avoid revisiting previously visited urls (using the dupefilter class). It's not entirely clear to me what you want to do here but I think you want to crawl the website and find all links? In that case you should move your second yield (yield {'url': response.url}
) to earlier in your parse function.
I think the following gives you what you want:
import scrapy
class MySpider(scrapy.Spider):
name = "copperpodip"
start_urls = ["https://copperpodip.com"]
allowed_domains = ["copperpodip.com"]
def parse(self, response):
yield {'url': response.url}
for link in response.xpath('*//a/@href').getall():
yield response.follow(link, self.parse)
if I run this as:
scrapy runspider scrapy_test.py -o test.json
then the resulting json file does not contain any duplicate links.
Upvotes: 2
Reputation: 101
I don't know scrapy at all, but can't you use a list (or a set, that's easier)and check if there's already a record of the same link ?
link_list = list
if link not in link_list :
link_list.append(link)
Edit : you seem to use a set already, that you change for a list just after :
all_links = set(all_links)
all_links = list(all_links)
Upvotes: 2