Reputation: 275
For the sentences:
"I am very hungry, so mum brings me a cake!
I want it split by delimiters, and I want all the delimiters except space to be saved as well. So the expected output is :
"I" "am" "very" "hungry" "," "so", "mum" "brings" "me" "a" "cake" "!" "\n"
What I am currently doing is re.split(r'([!:''".,(\s+)\n])', text)
, which split the whole sentences but also saved a lot of space characters which I don't want. I've also tried the regular expression \s|([!:''".,(\s+)\n])
, which gives me a lot of None somehow.
Upvotes: 2
Views: 307
Reputation: 214959
search
or findall
might be more appropriate here than split
:
import re
s = "I am very hungry, so mum brings me a !#$#@ cake!"
print(re.findall(r'[^\w\s]+|\w+', s))
# ['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', '!#$#@', 'cake', '!']
The pattern [^\w\s]+|\w+
means: a sequence of symbols which are neither alphanumeric nor whitespace OR a sequence of alphanumerics (that is, a word)
Upvotes: 1
Reputation: 61910
One approach is to surround the special characters (,!.\n)
with space and then split on space:
import re
def tokenize(t, pattern="([,!.\n])"):
return [e for e in re.sub(pattern, r" \1 ", t).split(' ') if e]
s = "I am very hungry, so mum brings me a cake!\n"
print(tokenize(s))
Output
['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', 'cake', '!', '\n']
Upvotes: 1
Reputation: 476709
That is because your regular expression contains a capture group. Because of that capture group, it will also include the matches in the result. But this is likely what you want.
The only challenge is to filter
out the None
s (and other values with truthiness False
) in case there is no match, we can do this with:
def tokenize(text):
return filter(None, re.split(r'[ ]+|([!:''".,\s\n])', text))
For your given sample text, this produces:
>>> list(tokenize("I am very hungry, so mum brings me a cake!\n"))
['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', 'cake', '!', '\n']
Upvotes: 1