Reputation: 12826
I am trying to use RegEx to extract a particular part of some URLs that come in different variations. Here is the generic format:
http://www.blackpages.com/cityName-StateName/mip/part-I-want-to-extract/randomCharacters
sometimes that "mip" part doesn't exist and the URL looks like this:
http://www.blackpages.com/cityName-StateName/part-I-want-to-extract/randomCharacters
I started writing the following RE:
re.compile("blackpages\.com/.*")
the .*
matches any character, Now, how do I stop when I encounter a "/" and extract everything that follows before the next "/" is encountered? This would give me the part I want to extract.
Upvotes: 1
Views: 95
Reputation: 626845
You need to use a negated character class:
re.compile(r"blackpages\.com/([^/]*)")
^^^^
The [^/]*
will match 0+ chars other than /
, as many as possible (greedily).
If you expect at least one char after /
, use +
quantifier (1 or more occurrences) instead of *
.
See the regex demo
import re
rx = r"blackpages\.com/([^/]*)"
ss = ["http://www.blackpages.com/cityName-StateName/mip/part-I-want-to-extract/randomCharacters",
"http://www.blackpages.com/cityName-StateName/part-I-want-to-extract/randomCharacters"]
for s in ss:
m = re.search(rx, s)
if m:
print(m.group(1))
Output:
cityName-StateName
cityName-StateName
Upvotes: 1