Reputation:
How can i go to link and get its sub links and again get its sub sub links?like for example,
I want to go to
"https://stackoverflow.com"
then extract its links e.g
['https://stackoverflow.com/questions/ask', 'https://stackoverflow.com/?tab=bounties']
and again go to that sub link and extract those sub links links.
Upvotes: 0
Views: 491
Reputation: 71
I would recommend using Scrapy for this. With Scrapy, you create a spider object which then is run by the Scrapy module.
First, to get all the links on a page, you can create a Selector object and find all of the hyperlink objects using the XPath:
hxs = scrapy.Selector(response)
urls = hxs.xpath('*//a/@href').extract()
Since the hxs.xpath returns an iterable list of paths, you can just iterate over them directly without storing them in a variable. Also each URL found should be passed back into this function using the callback argument, allowing it to recursively find all the links within each URL found:
hxs = scrapy.Selector(response)
for url in hxs.xpath('*//a/@href').extract():
yield scrapy.http.Request(url=url, callback=self.parse)
Each path found might not contain the original URL, so that check has to be made:
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
Finally, the each URL can be passed to a different function to be parsed, in this case it's just printed:
self.handle(url)
All of this put together in a full Spider object looks like this:
import scrapy
class StackSpider(scrapy.Spider):
name = "stackoverflow.com"
# limit the scope to stackoverflow
allowed_domains = ["stackoverflow.com"]
start_urls = [
"https://stackoverflow.com/",
]
def parse(self, response):
hxs = scrapy.Selector(response)
# extract all links from page
for url in hxs.xpath('*//a/@href').extract():
# make it a valid url
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
# process the url
self.handle(url)
# recusively parse each url
yield scrapy.http.Request(url=url, callback=self.parse)
def handle(self, url):
print(url)
And the spider would be run like this:
$ scrapy runspider spider.py > urls.txt
Also, keep in mind that running this code will get you rate limited from stack overflow. You might want to find a different target for testing, ideally a site that you're hosting yourself.
Upvotes: 1