Reputation: 383
I'm tokenizing sentences with the following function using regular expressions.
def tokenize2(text):
if text is None: return ''
if len(text) == 0: return text
# 1. Delete urls
text = re.sub(
r"((https?|ftps?|file)?:\/\/)?(?:[\w\d!#$&'()*\+,:;=?@[\]\-_.~]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" +
"\.([\w\d]{2,6})(\/(?:[\w\d!#$&'()*\+,:;=?@[\]\-_.~]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)*",
'', text)
# 2. Delete #hashtags and @usernames
text = re.sub(r'[@#]\w+', '', text)
# 3. Replace apostrophe to space
text = re.sub(r"'", "'", text)
# 4. Delete ip, non informative digits
text = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', '', text)
# 5. RT
text = re.sub(r'RT\s*:', '', text)
# 6. Tokenize
tokens = re.findall(r'''(?:\w\.)+\w\.?|\w{3,}-?\w*-*\w*''', text)
tokens = [w.lower() for w in tokens]
return tokens
The problem is that it removes words less than length three before the apostrophe and dash. For example for French words
text = "L’apostrophe, l’île, l’histoire, s’il, qu’elle, t’aime, s’appelle, n’oublie pas, jusqu’à, d’une. Apporte-m’en un. Allez–y! Regarde–le!"
tokenize2(text)
['apostrophe',
'île',
'histoire',
'elle',
'aime',
'appelle',
'oublie',
'pas',
'jusqu',
'une',
'apporte-m',
'allez',
'regarde']
If the word length is greater than three, then the function works fine. For example
text = "to cold-shoulder"
tokenize2(text)
['cold-shoulder']
That is, I need to remove words less than length 3, but if they are in the combination of an apostrophe and a dash, then they do not need to be removed
Upvotes: 2
Views: 159
Reputation: 1185
This regex do it (if I've correctly understood your goal): [-–’\w]{3,}
Full code:
def tokenize2(text):
if text is None:
return ''
if len(text) == 0:
return text
# 1. Delete urls
text = re.sub(
r"((https?|ftps?|file)?:\/\/)?(?:[\w\d!#$&'()*\+,:;=?@[\]\-_.~]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" +
"\.([\w\d]{2,6})(\/(?:[\w\d!#$&'()*\+,:;=?@[\]\-_.~]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)*",
'', text)
# 2. Delete #hashtags and @usernames
text = re.sub(r'[@#]\w+', '', text)
# 3. Replace apostrophe to space
text = re.sub(r"'", "'", text)
# 4. Delete ip, non informative digits
text = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', '', text)
# 5. RT
text = re.sub(r'RT\s*:', '', text)
# 6. Tokenize
tokens = re.findall(r'''[-–’\w]{3,}''', text)
if __name__ == "__main__":
text = "L’apostrophe, l’île, l’histoire, s’il, qu’elle, t’aime, s’appelle, n’oublie pas, jusqu’à, d’une. Apporte-m’en un. Allez–y! Regarde–le!"
print(tokenize2(text))
# >> ['l’apostrophe', 'l’île', 'l’histoire', 's’il', 'qu’elle', 't’aime', 's’appelle', 'n’oublie', 'pas', 'jusqu’à', 'd’une', 'apporte-m’en', 'allez–y', 'regarde–le']
As I said in my comment, the Step 3 in your code was useless as the replacement character is the same as the replaced character, but had the replacement character been a whitespace as you mentioned you wanted to do in the comment, it would have been impossible for you to then parse the apostrophe correctly.
Additionally, if s'il
should not be matched because it only has a length of maximum 2, then (?=[a-zA-Z'-]{3,}['-])([a-zA-Z'-]+)
may be good. For more explanations: this answer is good.
EDIT: Added hyphen to regex.
Pay attention to the fact that the only thing missing was the hyphen that needed to be added to the regex. -
and –
are different characters!
Upvotes: 1