user13415013
user13415013

Reputation:

How to get links inside links from webpage in python?

How can i go to link and get its sub links and again get its sub sub links?like for example,

I want to go to

"https://stackoverflow.com"

then extract its links e.g

['https://stackoverflow.com/questions/ask', 'https://stackoverflow.com/?tab=bounties']

and again go to that sub link and extract those sub links links.

Upvotes: 0

Views: 491

Answers (1)

Samiser
Samiser

Reputation: 71

I would recommend using Scrapy for this. With Scrapy, you create a spider object which then is run by the Scrapy module.

First, to get all the links on a page, you can create a Selector object and find all of the hyperlink objects using the XPath:

hxs = scrapy.Selector(response)
urls = hxs.xpath('*//a/@href').extract()

Since the hxs.xpath returns an iterable list of paths, you can just iterate over them directly without storing them in a variable. Also each URL found should be passed back into this function using the callback argument, allowing it to recursively find all the links within each URL found:

hxs = scrapy.Selector(response)
for url in hxs.xpath('*//a/@href').extract():
    yield scrapy.http.Request(url=url, callback=self.parse)

Each path found might not contain the original URL, so that check has to be made:

    if not ( url.startswith('http://') or url.startswith('https://') ):
        url = "https://stackoverflow.com/" + url

Finally, the each URL can be passed to a different function to be parsed, in this case it's just printed:

    self.handle(url)

All of this put together in a full Spider object looks like this:

import scrapy

class StackSpider(scrapy.Spider):
    name = "stackoverflow.com"
    # limit the scope to stackoverflow
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "https://stackoverflow.com/",
    ]

    def parse(self, response):
        hxs = scrapy.Selector(response)
        # extract all links from page
        for url in hxs.xpath('*//a/@href').extract():
            # make it a valid url
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url = "https://stackoverflow.com/" + url
            # process the url
            self.handle(url)
            # recusively parse each url
            yield scrapy.http.Request(url=url, callback=self.parse)

    def handle(self, url):
        print(url)

And the spider would be run like this:

$ scrapy runspider spider.py > urls.txt

Also, keep in mind that running this code will get you rate limited from stack overflow. You might want to find a different target for testing, ideally a site that you're hosting yourself.

Upvotes: 1

Related Questions