Reputation: 2475
I have this URL :
http://www.exmaple.com/boo/a.php?a=jsd
and what i want the output is something like this :
http://www.exmaple.com/boo/
like wise if i have
http://www.exmaple.com/abc.html
it should be
http://www.exmaple.com/
and
http://www.exmaple.com/
should return
http://www.exmaple.com/
without any change
This is what i have tried
re.sub(r'\?[\S]+','',"http://www.exmaple.com/boo/a.php?a=jsd")
but it returns
http://www.exmaple.com/boo/a.php
Any suggestions what could be done to get the correct output or does anyone have any better ideas to get this done ?
Upvotes: 3
Views: 2095
Reputation: 3857
There might be a more optimized way to do it but with this one you won't need an obscure import or third party package.
url = "http://www.google.com/abc/abc.html?q=test"
cleaned_url = url[:url.rindex("?")]
cleaned_url = cleaned_url.split("/")
cleaned_url = [item for item in cleaned_url if ".html" not in item]
cleaned_url = "/".join(cleaned_url)
Upvotes: 0
Reputation: 451
I would do something like that:
>>> import re
>>> url = "http://www.exmaple.com/boo/a.php?a=jsd"
>>> url[:url.rfind("/")+1]
'http://www.exmaple.com/boo/'
To remove everything that is after the last "/". I am not sure it covers all special cases though...
EDIT: New solution using urlparse
and my simple rfind
:
import re, urlparse
def url_cutter(url):
up = urlparse.urlparse(url)
url2 = up[0]+"://"+up[1]+up[2]
if url.rfind("/")>6:
url2 = url2[:url2.rfind("/")+1]
return url2
Then:
In [36]: url_cutter("http://www.exmaple.com/boo/a.php?a=jsd")
Out[36]: 'http://www.exmaple.com/boo/'
In [37]: url_cutter("http://www.exmaple.com/boo/a.php?a=jsd#dvt_on")
Out[37]: 'http://www.exmaple.com/boo/'
In [38]: url_cutter("http://www.exmaple.com")
Out[38]: 'http://www.exmaple.com'
Upvotes: 1
Reputation: 7357
Please, use the stdlib urlparse
module, like this. Generally, I try to avoid regex unless it is absolutely necessary.
from urlparse import urlparse, urlunparse
>>> parsed = urlparse("http://www.exmaple.com/boo/a.php?a=jsd")
>>> scheme, netloc, path, params, query, fragment = parsed
>>> urlunparse((scheme,netloc,path.split('/')[1],'','',''))
'http://www.exmaple.com/boo'
Upvotes: 5