Reputation: 13
import regex as re
def tokenize(text):
return re.findall(r'[\w-][-]*\p{L}[\w-]*',text)
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
tokens= tokenize(text)
print("|".join(tokens))
My output is like that
let|defeat|the|SARS-coV-2|delta|variant|together|in
I would like to get the following out put with no -
|Let|s|defeat|the|SARS|CoV|Delta|variant|together|in
Upvotes: 1
Views: 276
Reputation: 42133
You could use re.sub to replace series of non-letters by the pipe delimiter:
import re
def tokenize(text):
return re.sub(r"[^A-Za-z]+", "|", text)
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
print(tokenize(text))
let|s|defeat|the|SARS|coV|delta|variant|together|in|
Upvotes: 2
Reputation: 627020
You want to extract any sequence of word chars if this "word" contains at least one letter.
You can achieve this with regex
(where \p{L}
can be used to match any letter) and re
modules (where [^\W\d_]
matches any letter):
# Python PyPi regex:
import regex as re
def tokenize(text):
return re.findall(r'\w*\p{L}\w*',text)
# Python built-in re:
import re
def tokenize(text):
return re.findall(r'\w*[^\W\d_]\w*',text)
import regex as re
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
def tokenize(text):
return re.findall(r'\w*\p{L}\w*',text)
print("|".join(tokenize(text)))
# => let|s|defeat|the|SARS|coV|delta|variant|together|in
import re
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
def tokenize(text):
return re.findall(r'\w*[^\W\d_]\w*',text)
print("|".join(tokenize(text)))
# => let|s|defeat|the|SARS|coV|delta|variant|together|in
Upvotes: 0
Reputation: 10709
You can simplify your regex pattern by just using re.split() on the characters that you consider as word-separators such as apostrophe '
, space
, dash -
, etc.
from itertools import filterfalse
import regex as re
def tokenize(text):
splits = re.split("['\s\-]", text)
splits = list(filterfalse(lambda value: re.search("\d", value), splits)) # Remove this line if you wish to include the digits
if splits:
splits[0] = splits[0].capitalize()
return splits
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
tokens= tokenize(text)
print("|" + "|".join(tokens)) # Remove <"|" +> if you don't intend to put a "|" at the start.
Output:
|Let|s|defeat|the|SARS|coV|delta|variant|together|in
Upvotes: 1