How to use scrapy's Spider and LinkExtractor in python script?

Question

I found some answers for a topic of how to extract all available links from any website and all of them were about scrapy module. ALso copied one of the code example:

from scrapy import Spider
from scrapy.linkextractors import LinkExtractor

class MySpider(Spider):
    name = 'myspider'
    start_urls = ['http://webpage.com']

    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            print (link)

But I need to launch it and get a simple python list of all html pages to get some information from them using urllib2 and bs4. How to launch this class correctly to get this list?

Raphael · Accepted Answer

scrapy is a great tool to scrapy websites but it is more than just a snippet as you posted. What you posted is a spider definition. If embedded in a scrapy project your can run this spider e.g. in your terminal with scrapy crawl myspider.

Then your spider will visit http://webpage.com extract all its links and follow them recursively. Each url will be printed out but thats all. In order to store those links you can create so called items which then can be exported by a defined item pipeline. The hole thing is too complex to post it in a single answer. The bottom line is: yes, scrapy is a strong tool you can use for link extraction and the best point to start is with scrapy tutorials: https://docs.scrapy.org/en/latest/intro/tutorial.html

luckily the scrapy documentation is great :)

How to use scrapy's Spider and LinkExtractor in python script?

Answers (1)

Related Questions

How to use scrapy&#39;s Spider and LinkExtractor in python script?

Answers (1)

Related Questions

How to use scrapy's Spider and LinkExtractor in python script?