Reputation: 416
I am working on a web crawler these days. In that project when my crawler gathers the links in the site some are URLs are like ; about.html
, /pages
, #form-login
, javascript:validate();
, ../help
, ../../
, ./
.
I have tried urllib's urlparse , urljoin and os module's join functions. However given below is the part of the code of my project which is related to the question.
from urllib.parse import urlparse, urljoin
base_url = input('Enter base url : ')
def make_links(link):
u = urlparse(link)
if link[:3] == 'www':
link = u['scheme'] + link
elif link[:1] == '/':
link = base_url + link
elif link[:3] == '../':
link = urljoin(base_url, link)
elif link[:2] == './':
link = urljoin(base_url, link)
link = base_url + '/' + link
print(link)
while True:
i = input("Enter your url : ")
if i == 'exit':
break
else:
make_links(i)
I except the output of the relative URL inputted by the user to be relative to the base URL inputted by the user. When the user inputs a absolute URL as the base_url
and then when the user enters the relative URL the output should be the absolute URL path where the user can access the web page through a browser. This program also should support any type of relative URL. If you want to know the ways of relative URLs represented, refer this,
http://webreference.com/html/tutorial2/3.html
It should not execute javascript when the program comes across URLs like
javascript:alert('foo-bar')
😜 😜 😜
Upvotes: 2
Views: 385
Reputation: 9881
urljoin
does most of the heavy lifting for you. Hence, something as simple as this would do the trick:
def make_links(link):
url = urljoin(base_url, link)
parsed = urlparse(url)
if not parsed.scheme or not parsed.scheme.startswith('http'):
# invalid, e.g. javascript, etc.
return None
return url
Example:
Enter base url : http://example.com/dir1/file.php
Enter your url : ../dir2
http://example.com/dir2
Enter your url : #hello
http://example.com/dir1/file.php#hello
Enter your url : javascript: return false
None
Enter your url : /world
http://example.com/world
Enter your url : www.test.com
http://example.com/dir1/www.test.com
Enter your url : http://www.test.com
http://www.test.com
As you can see, the only downside is the necessity to start urls with http
. And this actually makes sense, as there are no strict rules: a website could use www as a subresource...
Upvotes: 1