Reputation: 734

An Algorithm to Determine How Similar Two Sentences Are

A friend of mine had an idea to make a speed reading program that displays words one by one (much like currently existing speed reading programs). However, the program would filter out words that aren't completely necessary to the meaning (if you want to skim something).

I have starting to implement this program, but I'm not quite sure on what the algorithm to get rid of "unimportant" words should be.

My idea is to parse the sentence (I'm currently using Stanford Parser) and somehow assign weights based on how important that word is to the sentence's meaning to each word then start removing words with the with the lowest weights. I will continue to do this, check how "different" the original tree and the new tree is. I will continue to remove the word with the lowest weight until the two trees are too different (I will determine some constant via a "calibration" process that each user goes through once). Finally, I will go through each word of the shortened sentence and try to replace it with a simpler or shorter synonym for that word (again while still trying to retain value).

As well, there will be special cases for very common words like "the," "a," and "of."

For example:

"Billy said to Jane, 'Do you want to go out?'"

Would become:

"Billy told Jane 'want go out?'"

This would retain basically all of the meaning of the sentence but shortened it significantly.

Is this a good idea for an algorithm and if so how would I assign the weights, what tree comparison algorithm should I use, and is inserting the synonyms done in a good place (i.e. should it be done before I try to remove any words)?

Upvotes: 3

Answers (3)

aerin

Reputation: 22724

Assuming you are using the word embedding as a weighting logic because I can't think of any better way to do it, you can convert phrases into vectors and compare those vectors. Low-weight words such as a, an, the, etc., will be nicely handled in this way.

This tutorial might help you: Phrase2Vec In Practice

Upvotes: 1

bogs

Reputation: 2296

You can use the method described in this paper for computing the similarity of two sentences: Corpus-based and Knowledge-based Measures of Text Semantic Similarity

You can remove words until the similarity with the original sentence drops significantly (this is an interesting problem in itself).

You can also check this simplified version of the similarity algorithm here: Wordnet Sentence Similarity

Upvotes: 2

postoronnim

Reputation: 556

Assigning weights is a million dollar question there. As a first step, I would identify parts of the sentence (subject-predicate-clause, etc) and the sentence structure (simple-compound-complex, etc) to find "anchor" words that would have highest weight. That should make the rest of the task easier.

Upvotes: 1

An Algorithm to Determine How Similar Two Sentences Are

Answers (3)

Related Questions