Reputation: 23
As an example i already managed to break the sentence
"That's a- tasty tic-tac. Or -not?" into an array of words like this:
words['That's', 'a-', 'tasty', 'tic-tac.','Or', '-not?']
.
Now i have to remove every special character i don't need and get this: words['That's', 'a', 'tasty', 'tic-tac','Or', 'not']
my actual current code looks like this:
pattern = re.compile('[\W_]+')
for x in range(0, file_text.__len__()):
for y in range(0, file_text[x].__len__()):
word_list.append(pattern.sub('', file_text[x][y]))
I have a whole text that i first turn into lines and words and then into just words
Upvotes: 2
Views: 606
Reputation: 626802
You can use
r"\b([-'])\b|[\W_]"
See the regex demo (the demo is a bit modified so that [\W_]
could not match newlines as the input at the demo site is a single multiline string).
Regex details
\b([-'])\b
- a -
or '
that are enclosed with word chars (letters, digits or underscores) (NOTE you may require to only exclude matching these symbols when enclosed with letters if you use (?<=[^\W\d_])([-'])(?=[^\W\d_])
)|
- or[\W_]
- any char other than a letter or a digit.See the Python demo:
import re
words = ["That's", 'a-', 'tasty', 'tic-tac.','Or', '-not?']
rx = re.compile(r"\b([-'])\b|[\W_]")
print( [rx.sub(r'\1', x) for x in words] )
# => ["That's", 'a', 'tasty', 'tic-tac', 'Or', 'not']
Upvotes: 1