Reputation: 499
I found some answers for a topic of how to extract all available links from any website and all of them were about scrapy module. ALso copied one of the code example:
from scrapy import Spider
from scrapy.linkextractors import LinkExtractor
class MySpider(Spider):
name = 'myspider'
start_urls = ['http://webpage.com']
def parse(self, response):
le = LinkExtractor()
for link in le.extract_links(response):
print (link)
But I need to launch it and get a simple python list of all html pages to get some information from them using urllib2
and bs4
.
How to launch this class correctly to get this list?
Upvotes: 0
Views: 157
Reputation: 1801
scrapy is a great tool to scrapy websites but it is more than just a snippet as you posted. What you posted is a spider definition. If embedded in a scrapy project your can run this spider e.g. in your terminal with scrapy crawl myspider
.
Then your spider will visit http://webpage.com
extract all its links and follow them recursively. Each url will be printed out but thats all.
In order to store those links you can create so called items which then can be exported by a defined item pipeline. The hole thing is too complex to post it in a single answer. The bottom line is: yes, scrapy is a strong tool you can use for link extraction and the best point to start is with scrapy tutorials: https://docs.scrapy.org/en/latest/intro/tutorial.html
luckily the scrapy documentation is great :)
Upvotes: 1