guri
guri

Reputation: 1701

Removing HTTP and WWW from URL python

url1='www.google.com'
url2='http://www.google.com'
url3='http://google.com'
url4='www.google'
url5='http://www.google.com/images'
url6='https://www.youtube.com/watch?v=6RB89BOxaYY

How to strip http(s) and www from url in Python?

Upvotes: 19

Views: 28724

Answers (4)

Limbail
Limbail

Reputation: 1

This will replace when http/https exist and finally if www. exist:

url=url.replace('http://','')
url=url.replace('https://','')
url=url.replace('www.','')

Upvotes: -1

Januka samaranyake
Januka samaranyake

Reputation: 2597

You can use the string method replace:

url = 'http://www.google.com/images'
url = url.replace("http://www.","")

or you can use regular expressions:

import re

url = re.compile(r"https?://(www\.)?")
url = url.sub('', 'http://www.google.com/images').strip().strip('/')

Upvotes: 31

WJA
WJA

Reputation: 7004

A more elegant solution would be using urlparse:

from urllib.parse import urlparse

def get_hostname(url, uri_type='both'):
    """Get the host name from the url"""
    parsed_uri = urlparse(url)
    if uri_type == 'both':
        return '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    elif uri_type == 'netloc_only':
        return '{uri.netloc}'.format(uri=parsed_uri)

The first option includes https or http, depending on the link, and the second part netloc includes what you were looking for.

Upvotes: 9

Tristan Bodding-Long
Tristan Bodding-Long

Reputation: 280

Could use regex, depending on how strict your data is. Are http and www always going to be there? Have you thought about https or w3 sites?

import re
new_url = re.sub('.*w\.', '', url, 1)

1 to not harm websites ending with a w.

edit after clarification

I'd do two steps:

if url.startswith('http'):
    url = re.sub(r'https?:\\', '', url)
if url.startswith('www.'):
    url = re.sub(r'www.', '', url)

Upvotes: 1

Related Questions