Reputation: 337
I want to extract UUID from urls.
for example:
/posts/eb8c6d25-8784-4cdf-b016-4d8f6df64a62?mc_cid=37387dcb5f&mc_eid=787bbeceb2
/posts/d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid/7034
/posts/5ff0021c-16cd-4f66-8881-ee28197ed1cf
I have thousands of this kind of string.
My regex now is ".*\/posts\/(.*)[/?]+.*"
which gives me the result like this:
d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid
84ba0472-926d-4f50-b3c6-46376b2fe9de/uid
6f3c97c1-b877-40e0-9479-6bdb826b7b8f/uid
f5e5dc6a-f42b-47d1-8ab1-6ae533415d24
f5e5dc6a-f42b-47d1-8ab1-6ae533415d24
f7842dce-73a3-4984-bbb0-21d7ebce1749
fdc6c48f-b124-447d-b4fc-bb528abb8e24
As you can see, my regex can't get rid of /uid
, but handle ?xxxx
, query parameter, fine.
What did I miss? How to make it right?
Thanks
Upvotes: 2
Views: 6256
Reputation: 473873
The .*
pattern is too broad and greedy for a UUID:
>>> import re
>>> data = """
... /posts/eb8c6d25-8784-4cdf-b016-4d8f6df64a62?mc_cid=37387dcb5f&mc_eid=787bbeceb2
... /posts/d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid/7034
... /posts/5ff0021c-16cd-4f66-8881-ee28197ed1cf
... """
>>>
>>> re.findall(r"/posts/([A-Za-z0-9\-]+)", data)
['eb8c6d25-8784-4cdf-b016-4d8f6df64a62',
'd78fa5da-4cbb-43b5-9fae-2b5c86f883cb',
'5ff0021c-16cd-4f66-8881-ee28197ed1cf']
Or, you can be more strict about the UUID format, see more:
Upvotes: 3
Reputation: 33335
Regular expressions try to match as many characters as possible (informally called "maximal munch").
A plain-English description of your regex .*\/posts\/(.*)[/?]+.*
would be something like:
Match anything, followed by
/posts/
, followed by anything, followed by one or more/?
, followed by anything.
When we apply that regex to this text:
.../posts/d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid/7034
... the maximal munch rule demands that the second "anything" match be as long as possible, therefore it ends up matching more than you wanted:
d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid
... because there is still the /7034
part remaining, which matches the remainder of the regex.
The best way to fix it is to use a regex which only matches characters that can actually occur in a UID (as suggested by @alecxe).
Upvotes: 2