tokenize apostrophe and dash python

Question

I'm tokenizing sentences with the following function using regular expressions.

def tokenize2(text):
    if text is None: return ''
    if len(text) == 0: return text
  #  1. Delete urls
    text = re.sub(
        r"((https?|ftps?|file)?:\/\/)?(?:[\w\d!#$&'()*\+,:;=?@[\]\-_.~]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" +
        "\.([\w\d]{2,6})(\/(?:[\w\d!#$&'()*\+,:;=?@[\]\-_.~]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)*",
        '', text)
    # 2. Delete #hashtags and @usernames
    text = re.sub(r'[@#]\w+', '', text)
    # 3. Replace apostrophe to space
    text = re.sub(r"'", "'", text)
    # 4. Delete ip, non informative digits
    text = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', '', text)
    # 5. RT
    text = re.sub(r'RT\s*:', '', text)
    # 6. Tokenize
    tokens = re.findall(r'''(?:\w\.)+\w\.?|\w{3,}-?\w*-*\w*''', text)
    tokens = [w.lower() for w in tokens]
    return tokens

The problem is that it removes words less than length three before the apostrophe and dash. For example for French words

text = "L’apostrophe, l’île, l’histoire, s’il, qu’elle, t’aime, s’appelle, n’oublie pas, jusqu’à, d’une. Apporte-m’en un. Allez–y! Regarde–le!"
tokenize2(text)
['apostrophe',
 'île',
 'histoire',
 'elle',
 'aime',
 'appelle',
 'oublie',
 'pas',
 'jusqu',
 'une',
 'apporte-m',
 'allez',
 'regarde']

If the word length is greater than three, then the function works fine. For example

text = "to cold-shoulder"
tokenize2(text)
['cold-shoulder']

That is, I need to remove words less than length 3, but if they are in the combination of an apostrophe and a dash, then they do not need to be removed

GregoirePelegrin · Accepted Answer

This regex do it (if I've correctly understood your goal): [-–’\w]{3,}

Full code:

def tokenize2(text):
    if text is None:
        return ''
    if len(text) == 0:
        return text
    #  1. Delete urls
    text = re.sub(
        r"((https?|ftps?|file)?:\/\/)?(?:[\w\d!#$&'()*\+,:;=?@[\]\-_.~]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" +
        "\.([\w\d]{2,6})(\/(?:[\w\d!#$&'()*\+,:;=?@[\]\-_.~]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)*",
        '', text)
    # 2. Delete #hashtags and @usernames
    text = re.sub(r'[@#]\w+', '', text)
    # 3. Replace apostrophe to space
    text = re.sub(r"'", "'", text)
    # 4. Delete ip, non informative digits
    text = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', '', text)
    # 5. RT
    text = re.sub(r'RT\s*:', '', text)
    # 6. Tokenize
    tokens = re.findall(r'''[-–’\w]{3,}''', text)

if __name__ == "__main__":
    text = "L’apostrophe, l’île, l’histoire, s’il, qu’elle, t’aime, s’appelle, n’oublie pas, jusqu’à, d’une. Apporte-m’en un. Allez–y! Regarde–le!"
    print(tokenize2(text))

# >> ['l’apostrophe', 'l’île', 'l’histoire', 's’il', 'qu’elle', 't’aime', 's’appelle', 'n’oublie', 'pas', 'jusqu’à', 'd’une', 'apporte-m’en', 'allez–y', 'regarde–le']

As I said in my comment, the Step 3 in your code was useless as the replacement character is the same as the replaced character, but had the replacement character been a whitespace as you mentioned you wanted to do in the comment, it would have been impossible for you to then parse the apostrophe correctly.

Additionally, if s'il should not be matched because it only has a length of maximum 2, then (?=[a-zA-Z'-]{3,}['-])([a-zA-Z'-]+) may be good. For more explanations: this answer is good.

EDIT: Added hyphen to regex.
Pay attention to the fact that the only thing missing was the hyphen that needed to be added to the regex. - and – are different characters!

tokenize apostrophe and dash python

Answers (1)

Related Questions