Naceur Fennich
Naceur Fennich

Reputation: 13

remove - using regular expression

import regex as re
def tokenize(text):
    return re.findall(r'[\w-][-]*\p{L}[\w-]*',text)
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
tokens= tokenize(text)
print("|".join(tokens))

My output is like that

let|defeat|the|SARS-coV-2|delta|variant|together|in

I would like to get the following out put with no -
|Let|s|defeat|the|SARS|CoV|Delta|variant|together|in

Upvotes: 1

Views: 276

Answers (3)

Alain T.
Alain T.

Reputation: 42133

You could use re.sub to replace series of non-letters by the pipe delimiter:

import re

def tokenize(text):
    return re.sub(r"[^A-Za-z]+", "|", text)

text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
print(tokenize(text))
let|s|defeat|the|SARS|coV|delta|variant|together|in|

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627020

You want to extract any sequence of word chars if this "word" contains at least one letter.

You can achieve this with regex (where \p{L} can be used to match any letter) and re modules (where [^\W\d_] matches any letter):

# Python PyPi regex:
import regex as re
def tokenize(text):
    return re.findall(r'\w*\p{L}\w*',text)

# Python built-in re:
import re
def tokenize(text):
    return re.findall(r'\w*[^\W\d_]\w*',text)

Python PyPi regex demo:

import regex as re
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"

def tokenize(text):
    return re.findall(r'\w*\p{L}\w*',text)

print("|".join(tokenize(text)))
# => let|s|defeat|the|SARS|coV|delta|variant|together|in

Python re demo:

import re
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
def tokenize(text):
    return re.findall(r'\w*[^\W\d_]\w*',text)

print("|".join(tokenize(text)))
# => let|s|defeat|the|SARS|coV|delta|variant|together|in

Upvotes: 0

Niel Godfrey P. Ponciano
Niel Godfrey P. Ponciano

Reputation: 10709

You can simplify your regex pattern by just using re.split() on the characters that you consider as word-separators such as apostrophe ', space , dash -, etc.

from itertools import filterfalse
import regex as re

def tokenize(text):
    splits = re.split("['\s\-]", text)
    splits = list(filterfalse(lambda value: re.search("\d", value), splits))  # Remove this line if you wish to include the digits
    if splits:
        splits[0] = splits[0].capitalize()
    return splits

text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
tokens= tokenize(text)
print("|" + "|".join(tokens))  # Remove <"|" +> if you don't intend to put a "|" at the start.

Output:

|Let|s|defeat|the|SARS|coV|delta|variant|together|in

Upvotes: 1

Related Questions