user8619267
user8619267

Reputation:

a checker of links in python

I have to create a function in python which searches links in an HTML page and then for every link, check if the link is broken or not.

Firstly I've created a function which searches every link (broken or not), no problem and stores them in a array called "g".

Secondly, I thought to create an array "a" which contains only the broken links by checking every item/element in array "g".

I don't know why, the check link by link doesn't work in the function. But if I copy a link and paste it like a parameter in function on the shell, it works.

import urllib.request
import re
import urllib.error

def check(url):
    try:
        f= urllib.request.urlopen(url)
    except urllib.error.HTTPError :
            return False
    return True
def trouveurdurls(url):
    f= urllib.request.urlopen(url)
    b=(f.read().decode('utf-8'))
    d=re.findall('<a href=".*</a>',b)
    liste=[]
    f.close()
    for el in d:
        g=re.findall('"https?://.*\.*"',el)
        liste.append(g)
    a=[]
    for el in liste:
        chaine= ''.join(map(str,el))
        url=chaine.replace('\'','')
        print(url)
        #check(url)


b=trouveurdurls("http://127.0.0.1/")

Upvotes: 0

Views: 2299

Answers (2)

Evan Nowak
Evan Nowak

Reputation: 895

Here's a version that uses "requests." It scrapes the page for urls, then checks each of the urls for their Response codes (200 means it worked, etc).

import requests
from bs4 import BeautifulSoup

page = 'http://the.web.site'

r = requests.get(page)

data = r.text

soup = BeautifulSoup(data, 'lxml')

urls = []

for a in soup.find_all('a'):
    urls.append(a.get('href'))

url_responses = {}

for url in urls:
    temp_r = requests.get(url)
    url_responses[url] = temp_r.status_code

Upvotes: 0

Pascal
Pascal

Reputation: 468

I'd be very surprised if the line

d=re.findall('<a href=".*</a>',b)

worked correctly. The asterisk is doing a greedy match, meaning that it finds the longest possible match. If you have more than a single link in your document, this regex will gobble up everything between the first and the end of the last link.

You should probably use the non-greedy version of the asterisk, e.g. *?.

Also, you're only interested in the href attribute, so

d = re.findall('<a href=".*?"', b)

is probably what you want. This will make the following code easier; you won't need to try and look for urls starting with http:// because you'll already have them in the d array. Your code as it is would also find urls that weren't part of an a's href attribute but just part of the normal text, or part of an image reference, etc.

Note that even the regex <a href=".*?" won't constantly produce matches, because HTML is very lazy with it's syntax, so you might also encounter strings like <A HREF='.....'>, <a href = "....., <a id="something" href="......"> and so on.

You'll have to either improve the regex to match all such cases, or turn to BeautifulSoup like Evan's answer or a full-fledged HTML parser that takes care of it for you.

Upvotes: 1

Related Questions