Vogelsen
Vogelsen

Reputation: 23

How to remove every special character except for the hyphen and the apostrophe inside and between words words?

As an example i already managed to break the sentence "That's a- tasty tic-tac. Or -not?" into an array of words like this: words['That's', 'a-', 'tasty', 'tic-tac.','Or', '-not?'].

Now i have to remove every special character i don't need and get this: words['That's', 'a', 'tasty', 'tic-tac','Or', 'not']

my actual current code looks like this:

pattern = re.compile('[\W_]+')

for x in range(0, file_text.__len__()):

for y in range(0, file_text[x].__len__()):

    word_list.append(pattern.sub('', file_text[x][y]))

I have a whole text that i first turn into lines and words and then into just words

Upvotes: 2

Views: 606

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626802

You can use

r"\b([-'])\b|[\W_]"

See the regex demo (the demo is a bit modified so that [\W_] could not match newlines as the input at the demo site is a single multiline string).

Regex details

  • \b([-'])\b - a - or ' that are enclosed with word chars (letters, digits or underscores) (NOTE you may require to only exclude matching these symbols when enclosed with letters if you use (?<=[^\W\d_])([-'])(?=[^\W\d_]))
  • | - or
  • [\W_] - any char other than a letter or a digit.

See the Python demo:

import re
words = ["That's", 'a-', 'tasty', 'tic-tac.','Or', '-not?']
rx = re.compile(r"\b([-'])\b|[\W_]")
print( [rx.sub(r'\1', x) for x in words] )
# => ["That's", 'a', 'tasty', 'tic-tac', 'Or', 'not']

Upvotes: 1

Related Questions