Lucas Shen
Lucas Shen

Reputation: 337

extract uuid from url

I want to extract UUID from urls.

for example:

/posts/eb8c6d25-8784-4cdf-b016-4d8f6df64a62?mc_cid=37387dcb5f&mc_eid=787bbeceb2
/posts/d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid/7034
/posts/5ff0021c-16cd-4f66-8881-ee28197ed1cf

I have thousands of this kind of string.

My regex now is ".*\/posts\/(.*)[/?]+.*" which gives me the result like this:

d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid
84ba0472-926d-4f50-b3c6-46376b2fe9de/uid
6f3c97c1-b877-40e0-9479-6bdb826b7b8f/uid
f5e5dc6a-f42b-47d1-8ab1-6ae533415d24
f5e5dc6a-f42b-47d1-8ab1-6ae533415d24
f7842dce-73a3-4984-bbb0-21d7ebce1749
fdc6c48f-b124-447d-b4fc-bb528abb8e24

As you can see, my regex can't get rid of /uid, but handle ?xxxx, query parameter, fine.

What did I miss? How to make it right?

Thanks

Upvotes: 2

Views: 6256

Answers (2)

alecxe
alecxe

Reputation: 473873

The .* pattern is too broad and greedy for a UUID:

>>> import re
>>> data = """
... /posts/eb8c6d25-8784-4cdf-b016-4d8f6df64a62?mc_cid=37387dcb5f&mc_eid=787bbeceb2
... /posts/d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid/7034
... /posts/5ff0021c-16cd-4f66-8881-ee28197ed1cf
... """
>>> 
>>> re.findall(r"/posts/([A-Za-z0-9\-]+)", data)
['eb8c6d25-8784-4cdf-b016-4d8f6df64a62', 
 'd78fa5da-4cbb-43b5-9fae-2b5c86f883cb', 
 '5ff0021c-16cd-4f66-8881-ee28197ed1cf']

Or, you can be more strict about the UUID format, see more:

Upvotes: 3

John Gordon
John Gordon

Reputation: 33335

Regular expressions try to match as many characters as possible (informally called "maximal munch").

A plain-English description of your regex .*\/posts\/(.*)[/?]+.* would be something like:

Match anything, followed by /posts/, followed by anything, followed by one or more /?, followed by anything.

When we apply that regex to this text:

.../posts/d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid/7034

... the maximal munch rule demands that the second "anything" match be as long as possible, therefore it ends up matching more than you wanted:

d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid

... because there is still the /7034 part remaining, which matches the remainder of the regex.

The best way to fix it is to use a regex which only matches characters that can actually occur in a UID (as suggested by @alecxe).

Upvotes: 2

Related Questions