Toly
Toly

Reputation: 3179

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.

I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word

I tried (using findall):

r'\b-\b'

for

str = 'word semi-column peace'

but, of course got only:

['-']

Thank you!

Upvotes: 2

Views: 966

Answers (5)

Mayur Koshti
Mayur Koshti

Reputation: 1862

You can also use the following regex:

>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']

Upvotes: 1

Mariano
Mariano

Reputation: 6511

a '-' (minus sign) in it but not at the beginning and not at the end of the word

Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.


We need to add an extra condition before and after the word:

  • Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
  • After: (?![-\w]) not followed by either a hyphen nor a word character.

Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:

  • \w+(?:-\w+)+ matches:
    • \w+ one or more word characters
    • (?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.

Regex:

(?<![-\w])\w+(?:-\w+)+(?![-\w])

regex101 demo

Code:

import re

pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"

result = re.findall(pattern, text)

ideone demo

Upvotes: 3

reticentroot
reticentroot

Reputation: 3682

You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.

st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
    print m.group(1)

Upvotes: 0

LetzerWille
LetzerWille

Reputation: 5668

str is a built in name, better not to use it for naming

st = 'word semi-column peace'
# \w+ word - \w+ word after - 
print(re.findall(r"\b\w+-\w+\b",st))

['semi-column']

Upvotes: 4

idjaw
idjaw

Reputation: 26600

What you actually want to do is a regex like this:

\w+-\w+

What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.

Upvotes: 4

Related Questions