Reputation: 27
I have a url stored as type=str
. Looks like this:
url = 'http://www.dog.com/bone?junk=8dj37hf7'
I want to delete all chars beginning with '?', so I would have:
url = 'http://www.dog.com/bone'
This is what I've tried:
import re
re.sub('?junk=*', '', url)
But I get this error:
raise error, v # invalid expression sre_constants.error: nothing to repeat
This is the solution:
import re
re.sub('\?junk=.*', '', url)
Edited to insert code bracketing. Edited to add .* notation per Morten Jensen, but the error persists.
Edit: Solved with '.*' and '\' escape. Thanks to Morten Jensen, jwodder, thefourtheye, et al.
Upvotes: 0
Views: 1620
Reputation: 1702
If you want to parse url, urlparse is better method.
from urlparse import urlparse
url = 'http://www.dog.com/bone?junk=8dj37hf7'
parsed = urlparse(url)
real_url = "http://{0}{1}".format(parsed.hostname, parsed.path)
Output:
'http://www.dog.com/bone'
Upvotes: 1
Reputation: 4675
You can try what thefourtheye said, or probably this:
>>> url = 'http://www.dog.com/bone?junk=8dj37hf7'
>>> newurl = url[:url.find('?')]
>>> print newurl
http://www.dog.com/bone
This method is faster as well, here's proof:
In [2]: url = 'http://www.dog.com/bone?junk=8dj37hf7'
In [3]: %timeit newurl = url[:url.find('?')]
1000000 loops, best of 3: 423 ns per loop
In [4]: import re
In [5]: %timeit x = re.sub('\?.*', '', url)
100000 loops, best of 3: 3.1 us per loop
In [6]: %timeit x = re.sub('\?.*', '', url)
100000 loops, best of 3: 3.25 us per loop
Upvotes: 1
Reputation: 114028
why not just
url = url.split("?",1)[0]
regex seems like trying to kill a fly with a sledgehammer here
Upvotes: 5
Reputation: 57590
The error is because ?
in a regex causes the immediately preceding item to become optional, and yet there is no preceding item here; to avoid this behavior, you need to escape the ?
with a backslash. Similarly, =*
will match zero or more =
s, not one =
followed by whatever, which would be =.*
. Thus, to get what you want, you need to use:
re.sub(r'\?junk=.*', '', url)
Upvotes: 1
Reputation: 239563
Quoting from http://docs.python.org/2/library/re.html#regular-expression-syntax
'?'
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
So, you need to escape the ?
with a backslash
url = 'http://www.dog.com/bone?junk=8dj37hf7'
import re
print re.sub('\?.*', '', url)
Output
http://www.dog.com/bone
Upvotes: 1