user2958776
user2958776

Reputation: 27

Truncating string with re.sub

I have a url stored as type=str. Looks like this:

url = 'http://www.dog.com/bone?junk=8dj37hf7'

I want to delete all chars beginning with '?', so I would have:

url = 'http://www.dog.com/bone'

This is what I've tried:

import re
re.sub('?junk=*', '', url)

But I get this error:

raise error, v # invalid expression sre_constants.error: nothing to repeat

This is the solution:

import re
re.sub('\?junk=.*', '', url)

Edited to insert code bracketing. Edited to add .* notation per Morten Jensen, but the error persists.

Edit: Solved with '.*' and '\' escape. Thanks to Morten Jensen, jwodder, thefourtheye, et al.

Upvotes: 0

Views: 1620

Answers (5)

Puffin GDI
Puffin GDI

Reputation: 1702

If you want to parse url, urlparse is better method.

from urlparse import urlparse

url = 'http://www.dog.com/bone?junk=8dj37hf7'
parsed = urlparse(url)
real_url = "http://{0}{1}".format(parsed.hostname, parsed.path)

Output:

'http://www.dog.com/bone'

Upvotes: 1

Anshu Dwibhashi
Anshu Dwibhashi

Reputation: 4675

You can try what thefourtheye said, or probably this:

>>> url = 'http://www.dog.com/bone?junk=8dj37hf7'
>>> newurl = url[:url.find('?')]
>>> print newurl

http://www.dog.com/bone

This method is faster as well, here's proof:

In [2]: url = 'http://www.dog.com/bone?junk=8dj37hf7'

In [3]: %timeit newurl = url[:url.find('?')]
1000000 loops, best of 3: 423 ns per loop

In [4]: import re

In [5]: %timeit x = re.sub('\?.*', '', url)
100000 loops, best of 3: 3.1 us per loop

In [6]: %timeit x = re.sub('\?.*', '', url)
100000 loops, best of 3: 3.25 us per loop

Upvotes: 1

Joran Beasley
Joran Beasley

Reputation: 114028

why not just

url = url.split("?",1)[0]

regex seems like trying to kill a fly with a sledgehammer here

Upvotes: 5

jwodder
jwodder

Reputation: 57590

The error is because ? in a regex causes the immediately preceding item to become optional, and yet there is no preceding item here; to avoid this behavior, you need to escape the ? with a backslash. Similarly, =* will match zero or more =s, not one = followed by whatever, which would be =.*. Thus, to get what you want, you need to use:

re.sub(r'\?junk=.*', '', url)

Upvotes: 1

thefourtheye
thefourtheye

Reputation: 239563

Quoting from http://docs.python.org/2/library/re.html#regular-expression-syntax

'?'

Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

So, you need to escape the ? with a backslash

url = 'http://www.dog.com/bone?junk=8dj37hf7'
import re
print re.sub('\?.*', '', url)

Output

http://www.dog.com/bone

Upvotes: 1

Related Questions