Reputation: 5863
I have a bunch of (ugly if I may say) urls, which I would like to clean up using python regex. So, my urls look something like:
http://www.thisislink1.com/this/is/sublink1/1
http://www.thisislink2.co.uk/this/is/sublink1s/klinks
http://www.thisislinkd.co/this/is/sublink1/hotlinks/2
http://www.thisislinkf.com.uk/this/is/sublink1d/morelink
http://www.thisislink1.co.in/this/is/sublink1c/mylink
....
What I'd like to do is clean up these urls, so that the final link looks like:
http://www.thisislink1.com
http://www.thisislink2.co.uk
http://www.thisislinkd.co
http://www.thisislinkf.de
http://www.thisislink1.us
....
and I was wondering how I can achieve this in a pythonic way. Sorry if this is a 101 question - I am new to pytho regex structures.
Upvotes: 0
Views: 286
Reputation: 85795
You should use a URL parser like others have suggested but for completeness here is a solution with regex:
import re
url='http://www.thisislink1.com/this/is/sublink1/1'
re.sub('(?<![/:])/.*','',url)
>>> 'http://www.thisislink1.com'
Explanation:
Match everything after and including the first forwardslash that is not preceded by a :
or /
and replace it with nothing ''
.
(?<![/:]) # Negative lookbehind for '/' or ':'
/.* # Match a / followed by anything
Upvotes: 1
Reputation: 879591
Use urlparse.urlsplit:
In [3]: import urlparse
In [8]: url = urlparse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')
In [9]: url.netloc
Out[9]: 'www.thisislink1.com'
In Python3 it would be
import urllib.parse as parse
url = parse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')
Upvotes: 7
Reputation: 64
Maybe use something like this:
result = re.sub(r"(?m)(http://(www)?\..*?)/", r"\1", subject)
Upvotes: 0
Reputation: 142156
Why use regex?
>>> import urlparse
>>> url = 'http://www.thisislinkd.co/this/is/sublink1/hotlinks/2'
>>> urlparse.urlsplit(url)
SplitResult(scheme='http', netloc='www.thisislinkd.co', path='/this/is/sublink1/hotlinks/2', query='', fragment='')
Upvotes: 6