Text classification by pattern

Question

Could you recomend me best way how to do it: i have a list phrases, for example ["free flower delivery","flower delivery Moscow","color + home delivery","flower delivery + delivery","order flowers + with delivery","color delivery"] and pattern - "flower delivery". I need to get list with phrases as close as possible to pattern.

Could you give some advice to how to do it?

nflacco · Accepted Answer

Simple Approach

Brute Force

There are lots of ways to do this, but the simplest way is do a direct match: Just search the input phrase for the string "flower delivery". That's pretty binary though, and you can modify this approach to use either bigrams or bag-of-words.

Bag-of-Words

Bag-of-words means we parse the phrase and pattern, and get a list or set of words there, i.e. ["flower", "delivery"]. You could score each phrase by figuring out some similarity metric (ie, does the set of words in the pattern occur in the phrase, and then rank the phrases for closest match:

bag_pattern = set()
for word in pattern:
    set.add(word)

for phrase in phrases:
    score = 0
    for word in phrase:
        if word in bag_pattern:
            score += 1
    # do something based on score

N-Grams

We might want to take position into account- ie "flower delivery" is a more relevant match than "delivery flower". We can calculate the N-Grams (typically, bigrams or trigrams, so 2 or 3 word groups) for the phrase and the pattern. Let's say we do bigrams:

"flower delivery Moscow" -> ["flower delivery", "delivery Moscow"

You can then apply some sort of scoring to decide how good of a match this is.

Text preprocessing

In general you want to do some text preprocessing. You may want to eliminate stop words in a bag-of-words approach ("the", "a", etc.), and you may want to normalize verbs and such to their root form.

Machine Learning

Ok, so your boss doesn't like simple stuff that works, and it's mandated you need to do machine learning. This will work too!

Naive Bayes

The simplest technique is to look at the probabilities of words, and multiply them. The classic example is spam detection for email.

The approach is to take a bunch of emails in text form, and group them into two classes- spam and not spam. Then, you go over all the emails, and for each unique word you see, you count the occurrences in spam, vs not spam. This gives you the probability of a word being in a spam email.

Imagine you an email with the following contents:

"Hello I am a Nigerian prince."

With your probabilities you calculated before, you can look up the probability for each word, multiply them together, and get a score for the email, normalized by the number of words. "Nigerian" and "prince" will have disproportionately high probabilities of being included in spam email, so this email will score very high!

Deep Learning

The following link covers bag-of-words and n-grams using deep learning techniques:

https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html