Reputation: 2591
Let's say I want to retrieve web site addresses that ended up by .com or .fr but not by .edu . Here is my attempt and obviously it does not work:
import re
text="www.cool.fr www.ham.edu www.stanford.com www.hack.ru"
re.findall(ur"\S+\.[com|fr]",text)
I guess it might exist something that I don't know about regexp in order to address this problem in an elegant way. Thanks in advance.
Upvotes: 2
Views: 134
Reputation: 626794
Your regex uses a character class [...]
where |
matches a literal |
symbol, it is not an alternation operator. The [com|fr]
class matches either c
, o
, m
, |
, f
or r
characters.
You need to use a group and make sure there is a word boundary after the com or fr:
import re
text="www.cool.fr www.ham.edu www.stanford.com www.hack.ru"
print(re.findall(r"\S+\.(?:com|fr)\b",text))
# => ['www.cool.fr', 'www.stanford.com']
See the IDEONE demo
The regex matches:
\S+\.
- 1 or more non-whitespace symbols followed by a literal .
(?:com|fr)
- a non0-capturing group matching 2 alternatives: either com
or fr
that are followed by...\b
- a word boundary.Upvotes: 4