talha06
talha06

Reputation: 6466

Proper way of eliminating letter repetitions from English words?

As the title clearly describes, I wonder what is the right way to eliminate character repetitions in English that are commonly used in social media to exaggerate the feeling. Since I am developing a software solution to correct mistyped words, I need a global algorithm that can be applied to most majority of English words. So, I am asking experts to learn the right way to eliminate additional letters in English words without using learning-based approachs?

ps. (1) I check programmatically if the word is valid or not using the WordNet 3.0 database. So far so good except some examples such as the word veery which is defined as tawny brown North American trush noted for its song in WordNet 3.0. I interrupt letter elimination process when the word is found in WordNet. So are there any other knowledge bases that can be used instead of WordNet?

ps. (2) Actually I asked this question at English Language & Usage community. But they guided me to ask it here.

Some examples:

haappyy --> happy
amaaazzinng --> amazing
veeerry --> very

As you see in the examples, the place of letter repetition various through the word.

Upvotes: 4

Views: 298

Answers (2)

errantlinguist
errantlinguist

Reputation: 3828

One major problem in mapping a non-canonical form with reduplicated letters to its canonical form is that it is ambiguous without context: Consider e.g. meeet — is that actually met or meet?

One way of addressing this could be to measure a word's "canonicity" as the probability of it given its context (i.e. the words preceding it) in an n-gram model: Words which are judged to be "aliases" of the word in question (described below) and are most frequent given the context at hand are treated as the "canonical" form:

Normalization

It is possible to think of forms such as veeerry and very as variants of the same form, namely a modified "ordered bag of characters", so that e.g. both are processed as sequences of ('v','e','r','y') albeit having different weighting:

  • veeerry: (('v', 1), ('e', 3), ('r', 2), ('y', 1))
  • very: (('v', 1), ('e', 1), ('r', 1), ('y', 1))

In this way, these forms can be processed like normal feature vectors — this will be used to weight their probabilities below.

Aliasing

We need to be able to map e.g. veeerry to very without, as you described, matching to the uncommon word veery. Therefore, simply representing veeerry as an "ordered bag of characters" without any weighting would mean that veeerry would just as plausibly be a variant of veery — which it obviously isn't. However, since veery most likely has a very different lexical distribution than to that of the adverb very, it would be possible to get the sense of a normalized ('v','e','r','y') based on its probability given its context — for example:

  1. This car is veery big.
  2. #This car is a veery. ("This car is a tawny brown North American thrush noted for its song")

Even without considering the syntactic part-of-speech distinction between these two examples (the first is an adverb and the second one is a noun), the raw statistical distribution of these two different words is very different. For that reason, given a good model, P(veery_1 | "This car is") > P(veery_2 | "This car is a").

Probabilistic weighting

In order to associate e.g. veeerry to very in order to normalize it when found in a text while still keeping veery from being normalized to very, we can then simply use the feature vector for the ordered bag of characters representing veeerry to calculate its distance from both very and veery and then use that distance to weight the probability of each in the given context:

best_term(form, context) = argmax(arg=term, (P(term, context) * sim(ordered_bag_of_chars(form), ordered_bag_of_chars(term)))

I've written a bit of Python code in order to better explain how this might work in real life:

#!/usr/bin/env python3

from scipy.spatial.distance import cosine

class Vocabulary(object):
    def __init__(self, forms):
        self.forms = forms
        self.form_variations = {}   
        for form in forms:
            unique_symbol_freqs = tuple(count_unique_symbol_freqs(form))
            unique_symbols = tuple(unique_symbol_freq[0] for unique_symbol_freq in unique_symbol_freqs)
            self.form_variations[unique_symbols] = WeightedSymbolSequence(unique_symbol_freqs)

class WeightedSymbolSequence(object):
    def __init__(self, unique_symbols):
        # TODO: Finish implementation
        pass

def count_unique_symbol_freqs(input_seq):
    if len(input_seq) > 0:
        # First process the head symbol
        previous_unique_symbol = (input_seq[0], 1)
        # Process tail symbols; Add extra iteration at the end in order to handle trailing single unique symbols
        tail_iter = iter(input_seq[1:])
        try:
            while True:
                input_symbol = next(tail_iter)
                if input_symbol == previous_unique_symbol[0]:
                    previous_unique_symbol = (previous_unique_symbol[0], previous_unique_symbol[1] + 1)
                else:
                    result_unique_symbol = previous_unique_symbol
                    previous_unique_symbol =  (input_symbol, 1)
                    yield result_unique_symbol
        except StopIteration:
            # The end of the sequence was encountered; Handle a potential last unique symbol
            yield previous_unique_symbol

if __name__ == '__main__':
    from sys import stderr

    tests = {"haappyy" : "happy", "amaaazzinng" : "amazing", "veeerry" : "very"}

    with open("/usr/share/dict/words", "r") as vocab_instream:
        # TODO: Alias uppercase versions of characters to their lowercase ones so that e.g. "dog" and "Dog" are highly similar but not identical
        vocab = Vocabulary(frozenset(line.strip().lower() for line in vocab_instream))

    for form1, form2 in tests.items():
        # First check if the token is an extant word 
        if form1 in vocab.forms:
            print("\"%s\" found in vocabulary list; No need to look for similar words." % form1)
        else:
            form1_unique_char_freqs = tuple(count_unique_symbol_freqs(form1))
            print(form1_unique_char_freqs)
            form2_unique_char_freqs = tuple(count_unique_symbol_freqs(form2))
            dist = cosine([form1_unique_char_freq[1] for form1_unique_char_freq in form1_unique_char_freqs], [form2_unique_char_freq[1] for form2_unique_char_freq in form2_unique_char_freqs])
            # TODO: Get probabilities of form variations using WeightedSymbolSequence objects in "vocab.form_variations" and then multiply the probability of each by the cosine similarity of bag of characters for form1 and form2

            try:
                # Get mapping to other variants, e.g. "haappyy" -> "happy"
                form1_unique_chars = tuple(form1_unique_char_freq[0] for form1_unique_char_freq in form1_unique_char_freqs)
                variations = vocab.form_variations[form1_unique_chars]
            except KeyError:
                # TODO: Use e.g. Levenshtein distance in order to find the next-most similar word: No other variations were found
                print("No variations of \"%s\" found." % form1, file=stderr)

Upvotes: 4

Ian Mercer
Ian Mercer

Reputation: 39277

An alternative to trying to correct words entered by a user is to generate the possibilities ahead of time and store them. If you took a list of common words (no odd birds), created some rules for repeating vowels and then added all of them to your dictionary (e.g. an in-memory ternary tree) this could work.

The repeated 'e' form of 'very' would be stored as 'veery', 'veeery', ... and if you augmented Wordnet with frequency information, 'veery' as a misspelling of 'very' would be marked as being more common than 'veery' the bird.

I do something similar to this in my Natural Language engine where I'm also trying to handle another common social media aberration: the lack of spaces between words.

Upvotes: 2

Related Questions