How to properly manipulate relative URLs in python?

Question

I am working on a web crawler these days. In that project when my crawler gathers the links in the site some are URLs are like ; about.html , /pages , #form-login , javascript:validate(); , ../help , ../../ , ./ .

I have tried urllib's urlparse , urljoin and os module's join functions. However given below is the part of the code of my project which is related to the question.


from urllib.parse import urlparse, urljoin

base_url = input('Enter base url : ')


def make_links(link):
    u = urlparse(link)
    if link[:3] == 'www':
        link = u['scheme'] + link
    elif link[:1] == '/':
        link = base_url + link
    elif link[:3] == '../':
        link = urljoin(base_url, link)
    elif link[:2] == './':
        link = urljoin(base_url, link)
        link = base_url + '/' + link
    print(link)


while True:
    i = input("Enter your url : ")
    if i == 'exit':
        break
    else:
        make_links(i)

I except the output of the relative URL inputted by the user to be relative to the base URL inputted by the user. When the user inputs a absolute URL as the base_url and then when the user enters the relative URL the output should be the absolute URL path where the user can access the web page through a browser. This program also should support any type of relative URL. If you want to know the ways of relative URLs represented, refer this,

http://webreference.com/html/tutorial2/3.html

It should not execute javascript when the program comes across URLs like javascript:alert('foo-bar') 😜 😜 😜

Derlin · Accepted Answer

urljoin does most of the heavy lifting for you. Hence, something as simple as this would do the trick:

def make_links(link):
    url = urljoin(base_url, link)
    parsed = urlparse(url)
    if not parsed.scheme or not parsed.scheme.startswith('http'):
        # invalid, e.g. javascript, etc.
        return None
    return url

Example:

Enter base url : http://example.com/dir1/file.php
Enter your url : ../dir2
http://example.com/dir2
Enter your url : #hello
http://example.com/dir1/file.php#hello
Enter your url : javascript: return false
None
Enter your url : /world
http://example.com/world
Enter your url : www.test.com
http://example.com/dir1/www.test.com
Enter your url : http://www.test.com
http://www.test.com

As you can see, the only downside is the necessity to start urls with http. And this actually makes sense, as there are no strict rules: a website could use www as a subresource...

How to properly manipulate relative URLs in python?

Answers (1)

Related Questions