Reputation: 3179
I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
Upvotes: 2
Views: 966
Reputation: 1862
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
Upvotes: 1
Reputation: 6511
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-"
is not a word character, you can't use word boundaries (\b
) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-"
will match both \b\w+-\w+\b
and \w+-\w+
.
We need to add an extra condition before and after the word:
(?<![-\w])
not preceded by either a hyphen nor a word character.(?![-\w])
not followed by either a hyphen nor a word character.Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+
matches:
\w+
one or more word characters(?:-\w+)+
a hyphen and one or more word characters, and also allows this last part to repeat.Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
Upvotes: 3
Reputation: 3682
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)
Upvotes: 0
Reputation: 5668
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
Upvotes: 4
Reputation: 26600
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
Upvotes: 4