Rakesh Adhikesavan
Rakesh Adhikesavan

Reputation: 12826

Extracting part of a URL using RegEx

I am trying to use RegEx to extract a particular part of some URLs that come in different variations. Here is the generic format:

http://www.blackpages.com/cityName-StateName/mip/part-I-want-to-extract/randomCharacters

sometimes that "mip" part doesn't exist and the URL looks like this:

http://www.blackpages.com/cityName-StateName/part-I-want-to-extract/randomCharacters

I started writing the following RE:

re.compile("blackpages\.com/.*")

the .* matches any character, Now, how do I stop when I encounter a "/" and extract everything that follows before the next "/" is encountered? This would give me the part I want to extract.

Upvotes: 1

Views: 95

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You need to use a negated character class:

re.compile(r"blackpages\.com/([^/]*)")
                            ^^^^

The [^/]* will match 0+ chars other than /, as many as possible (greedily).

If you expect at least one char after /, use + quantifier (1 or more occurrences) instead of *.

See the regex demo

Python code:

import re
rx = r"blackpages\.com/([^/]*)"
ss = ["http://www.blackpages.com/cityName-StateName/mip/part-I-want-to-extract/randomCharacters",
"http://www.blackpages.com/cityName-StateName/part-I-want-to-extract/randomCharacters"]
for s in ss:
    m = re.search(rx, s)
    if m:
        print(m.group(1))

Output:

cityName-StateName
cityName-StateName

Upvotes: 1

Related Questions