AJW
AJW

Reputation: 5863

python regex urls

I have a bunch of (ugly if I may say) urls, which I would like to clean up using python regex. So, my urls look something like:

http://www.thisislink1.com/this/is/sublink1/1
http://www.thisislink2.co.uk/this/is/sublink1s/klinks
http://www.thisislinkd.co/this/is/sublink1/hotlinks/2
http://www.thisislinkf.com.uk/this/is/sublink1d/morelink
http://www.thisislink1.co.in/this/is/sublink1c/mylink
....

What I'd like to do is clean up these urls, so that the final link looks like:

http://www.thisislink1.com
http://www.thisislink2.co.uk
http://www.thisislinkd.co
http://www.thisislinkf.de
http://www.thisislink1.us
....

and I was wondering how I can achieve this in a pythonic way. Sorry if this is a 101 question - I am new to pytho regex structures.

Upvotes: 0

Views: 286

Answers (4)

Chris Seymour
Chris Seymour

Reputation: 85795

You should use a URL parser like others have suggested but for completeness here is a solution with regex:

import re

url='http://www.thisislink1.com/this/is/sublink1/1'

re.sub('(?<![/:])/.*','',url)

>>> 'http://www.thisislink1.com'

Explanation:

Match everything after and including the first forwardslash that is not preceded by a : or / and replace it with nothing ''.

(?<![/:]) # Negative lookbehind for '/' or ':'
/.*       # Match a / followed by anything

Upvotes: 1

unutbu
unutbu

Reputation: 879591

Use urlparse.urlsplit:

In [3]: import urlparse    

In [8]: url = urlparse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')

In [9]: url.netloc
Out[9]: 'www.thisislink1.com'

In Python3 it would be

import urllib.parse as parse
url = parse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')

Upvotes: 7

Andreas
Andreas

Reputation: 64

Maybe use something like this:

result = re.sub(r"(?m)(http://(www)?\..*?)/", r"\1", subject)

Upvotes: 0

Jon Clements
Jon Clements

Reputation: 142156

Why use regex?

>>> import urlparse
>>> url = 'http://www.thisislinkd.co/this/is/sublink1/hotlinks/2'
>>> urlparse.urlsplit(url)
SplitResult(scheme='http', netloc='www.thisislinkd.co', path='/this/is/sublink1/hotlinks/2', query='', fragment='')

Upvotes: 6

Related Questions