Alex
Alex

Reputation: 44265

How to find recursively all links from a webpage with beautifulsoup?

I have been trying to use some code I found in this answer to recursively find all links from a given URL:

import urllib2
from bs4 import BeautifulSoup

url = "http://francaisauthentique.libsyn.com/"

def recursiveUrl(url,depth):

    if depth == 5:
        return url
    else:
        page=urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        newlink = soup.find('a') #find just the first one
        if len(newlink) == 0:
            return url
        else:
            return url, recursiveUrl(newlink,depth+1)


def getLinks(url):
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    links = soup.find_all('a')
    for link in links:
        links.append(recursiveUrl(link,0))
    return links

links = getLinks(url)
print(links)

and besides a warning

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 28 of the file downloader.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

I get the following error:

Traceback (most recent call last):
  File "downloader.py", line 28, in <module>
    links = getLinks(url)
  File "downloader.py", line 25, in getLinks
    links.append(recursiveUrl(link,0))
  File "downloader.py", line 11, in recursiveUrl
    page=urllib2.urlopen(url)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 396, in open
    protocol = req.get_type()
TypeError: 'NoneType' object is not callable

What is the problem?

Upvotes: 3

Views: 8375

Answers (2)

Vipul Sharma
Vipul Sharma

Reputation: 798

This code will recursively go to every link and keep on appending full urls to a list. The final output will be a bunch of urls

import requests
from bs4 import BeautifulSoup

listUrl = []

def recursiveUrl(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    links = soup.find_all('a')
    if links is None or len(links) == 0:
        listUrl.append(url)
        print(url)
        return 1;
    else:
        listUrl.append(url)
        print(url)
        for link in links:
            #print(url+link['href'][1:])
            recursiveUrl(url+link['href'][1:])


recursiveUrl('http://target.com')
print(listUrl)

Upvotes: 0

Ali
Ali

Reputation: 1357

Your recursiveUrl tries to access a url link that is invalid like: /webpage/category/general which is the value your extracted from one of the href links.

You should be appending the extracted href value to the website's url and then try to open the webpage. You will need to work on your algorithm for recursion, as I don't know what you want to achieve.

Code:

import requests
from bs4 import BeautifulSoup

def recursiveUrl(url, link, depth):
    if depth == 5:
        return url
    else:
        print(link['href'])
        page = requests.get(url + link['href'])
        soup = BeautifulSoup(page.text, 'html.parser')
        newlink = soup.find('a')
        if len(newlink) == 0:
            return link
        else:
            return link, recursiveUrl(url, newlink, depth + 1)

def getLinks(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    links = soup.find_all('a')
    for link in links:
        links.append(recursiveUrl(url, link, 0))
    return links

links = getLinks("http://francaisauthentique.libsyn.com/")
print(links)

Output:

http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/10
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/09
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/08
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/07
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general

Upvotes: 3

Related Questions