hans glick
hans glick

Reputation: 2591

python regular expression : match one of listed substrings

Let's say I want to retrieve web site addresses that ended up by .com or .fr but not by .edu . Here is my attempt and obviously it does not work:

import re
text="www.cool.fr www.ham.edu www.stanford.com www.hack.ru"
re.findall(ur"\S+\.[com|fr]",text)

I guess it might exist something that I don't know about regexp in order to address this problem in an elegant way. Thanks in advance.

Upvotes: 2

Views: 134

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626794

Your regex uses a character class [...] where | matches a literal | symbol, it is not an alternation operator. The [com|fr] class matches either c, o, m, |, f or r characters.

You need to use a group and make sure there is a word boundary after the com or fr:

import re
text="www.cool.fr www.ham.edu www.stanford.com www.hack.ru"
print(re.findall(r"\S+\.(?:com|fr)\b",text))
# => ['www.cool.fr', 'www.stanford.com']

See the IDEONE demo

The regex matches:

  • \S+\. - 1 or more non-whitespace symbols followed by a literal .
  • (?:com|fr) - a non0-capturing group matching 2 alternatives: either com or fr that are followed by...
  • \b - a word boundary.

Upvotes: 4

Related Questions