PrivateUser
PrivateUser

Reputation: 4524

How can I prepend http to a url if it doesn't begin with http?

I have urls formatted as:

google.com
www.google.com
http://google.com
http://www.google.com

I would like to convert all type of links to a uniform format, starting with http://

http://google.com

How can I prepend URLs with http:// using Python?

Upvotes: 29

Views: 16682

Answers (6)

racitup
racitup

Reputation: 464

If you're certain the url begins with the domain, shouldn't the answer be:

parts = urlparse(url, "http")
if not parts.netloc:
  parts = urlparse("//" + url, "http")
return parts.geturl()

Upvotes: 0

David Wilkinson
David Wilkinson

Reputation: 9

If you URLs are a string type you could just concatenate.

one = "https://"
two = "www.privateproperty.co.za"

link = "".join((one, two))

Upvotes: 0

cider
cider

Reputation: 407

def fix_url(orig_link):
    # force scheme 
    split_comps = urlsplit(orig_link, scheme='https')
    # fix netloc (can happen when there is no scheme)
    if not len(split_comps.netloc):
        if len(split_comps.path):
            # override components with fixed netloc and path
            split_comps = SplitResult(scheme='https',netloc=split_comps.path,path='',query=split_comps.query,fragment=split_comps.fragment)
        else: # no netloc, no path 
            raise ValueError
    return urlunsplit(split_comps)

Upvotes: 0

Rehmat
Rehmat

Reputation: 5071

I found it easy to detect the protocol with regex and then append it if missing:

import re
def formaturl(url):
    if not re.match('(?:http|ftp|https)://', url):
        return 'http://{}'.format(url)
    return url

url = 'test.com'
print(formaturl(url)) # http://test.com

url = 'https://test.com'
print(formaturl(url)) # https://test.com

I hope it helps!

Upvotes: 12

JBernardo
JBernardo

Reputation: 33397

Python do have builtin functions to treat that correctly, like

p = urlparse.urlparse(my_url, 'http')
netloc = p.netloc or p.path
path = p.path if p.netloc else ''
if not netloc.startswith('www.'):
    netloc = 'www.' + netloc

p = urlparse.ParseResult('http', netloc, path, *p[3:])
print(p.geturl())

If you want to remove (or add) the www part, you have to edit the .netloc field of the resulting object before calling .geturl().

Because ParseResult is a namedtuple, you cannot edit it in-place, but have to create a new object.

PS:

For Python3, it should be urllib.parse.urlparse

Upvotes: 23

barak manos
barak manos

Reputation: 30136

For the formats that you mention in your question, you can do something as simple as:

def convert(url):
    if url.startswith('http://www.'):
        return 'http://' + url[len('http://www.'):]
    if url.startswith('www.'):
        return 'http://' + url[len('www.'):]
    if not url.startswith('http://'):
        return 'http://' + url
    return url

But please note that there are probably other formats that you are not anticipating. In addition, keep in mind that the output URL (according to your definitions) will not necessarily be a valid one (i.e., the DNS will not be able to translate it into a valid IP address).

Upvotes: 6

Related Questions