Reputation: 1701
url1='www.google.com'
url2='http://www.google.com'
url3='http://google.com'
url4='www.google'
url5='http://www.google.com/images'
url6='https://www.youtube.com/watch?v=6RB89BOxaYY
How to strip http(s)
and www
from url in Python?
Upvotes: 19
Views: 28724
Reputation: 1
This will replace when http/https exist and finally if www. exist:
url=url.replace('http://','')
url=url.replace('https://','')
url=url.replace('www.','')
Upvotes: -1
Reputation: 2597
You can use the string method replace
:
url = 'http://www.google.com/images'
url = url.replace("http://www.","")
or you can use regular expressions:
import re
url = re.compile(r"https?://(www\.)?")
url = url.sub('', 'http://www.google.com/images').strip().strip('/')
Upvotes: 31
Reputation: 7004
A more elegant solution would be using urlparse:
from urllib.parse import urlparse
def get_hostname(url, uri_type='both'):
"""Get the host name from the url"""
parsed_uri = urlparse(url)
if uri_type == 'both':
return '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
elif uri_type == 'netloc_only':
return '{uri.netloc}'.format(uri=parsed_uri)
The first option includes https
or http
, depending on the link, and the second part netloc
includes what you were looking for.
Upvotes: 9
Reputation: 280
Could use regex, depending on how strict your data is. Are http and www always going to be there? Have you thought about https or w3 sites?
import re
new_url = re.sub('.*w\.', '', url, 1)
1 to not harm websites ending with a w.
edit after clarification
I'd do two steps:
if url.startswith('http'):
url = re.sub(r'https?:\\', '', url)
if url.startswith('www.'):
url = re.sub(r'www.', '', url)
Upvotes: 1