Reputation: 563

regex to remove every hyphen except between two words

I am cleaning a text and I would like to remove all the hyphens and special characters. Except for the hyphens between two words such as: tic-tacs, popcorn-flavoured.

I wrote the below regex but it removes every hyphen.

text='popcorn-flavoured---'
new_text=re.sub(r'[^a-zA-Z0-9]+', '',text)
new_text

I would like the output to be:

popcorn-flavoured

Upvotes: 0

Answers (4)

The fourth bird

Reputation: 163287

As an alternative, you could capture a hyphen between word characters and keep that group in the replacement. Using an alternation, you could match the hyphens that you want to remove.

(\w+-\w+)|-+

Explanation

(\w+-\w+) Capture group 1, match 1+ word chars, hyphen and 1+ word chars
| Or
-+ Match 1+ times a hyphen

Regex demo | Python demo

Example code

import re

regex = r"(\w+-\w+)|-+"
test_str = ("popcorn-flavoured---\n"
    "tic-tacs")

result = re.sub(regex, r"\1", test_str)
print (result)

Output

popcorn-flavoured
tic-tacs

Upvotes: 4

Cary Swoveland

Reputation: 110675

You can replace matches of the regular expression

-(?!\w)|(?<!\w)-

with empty strings.

Regex demo _{^<¯\_(ツ)_/¯^>} Python demo

The regex will match hyphens that are not both preceded and followed by a word character.

Python's regex engine performs the following operations.

-        match '-'
(?!\w)   the previous character is not a word character
|
(?<!\w)  the following character is not a word character
-        match '-'

(?!\w) is a negative lookahead; (?<!\w) is a negative lookbehind.

Upvotes: 5

DarrylG

Reputation: 17156

You can use

p = re.compile(r"(\b[-]\b)|[-]")
result = p.sub(lambda m: (m.group(1) if m.group(1) else ""), text)

Test

With:

text='popcorn-flavoured---'

Output (result):

popcorn-flavoured

Explanation

This pattern detects hyphens between two words:

(\b[-]\b)

This pattern detects all hyphens

[-]

Regex substitution

p.sub(lambda m: (m.group(1) if m.group(1) else " "), text)

When hyphen detected between two words m.group(1) exists, so we maintain things as they are

else "")

Occurs when the pattern was triggered by [-] then we substitute a "" for the hyphen removing it.

Upvotes: 0

revliscano

Reputation: 2272

You can use findall() to get that part that matches your criteria.

new_text = re.findall('[\w]+[-]?[\w]+', text)[0]

Play around with it with other inputs.

Upvotes: 1

regex to remove every hyphen except between two words

Answers (4)

Related Questions