Kristian
Kristian

Reputation: 21800

Algorithm to compare similarity of ideas (as strings)

Consider an arbitrary text box that records the answer to the question, what do you want to do before you die?

Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question).

  1. Is there another or better way to do something like this?
  2. Is this any different than string similarity?
  3. Is this the right question to be asking?

The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing

Upvotes: 5

Views: 1245

Answers (3)

mbatchkarov
mbatchkarov

Reputation: 16039

What you want is very much an open problem in NLP. @Ali's answer describes the idea at a high level, but the part "Construct a document vector for every answer" is the really hard one. There are a few obvious ways of building a document vector from a the vectors of the words it contains. Addition, multiplication and averaging are fast, but they affectively ignore the syntax. Man bites dog and Dog bites man will have the same representation, but clearly not the same meaning. Google compositional distributional semantics- as far as I know, there are people at Universities of Texas, Trento, Oxford, Sussex and at Google working in the area.

Upvotes: 1

Ali Ferhat
Ali Ferhat

Reputation: 2579

It is much more difficult than string similarity. This is what you need to do at a minimum:

  • Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"
  • Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.
  • Calculate a weight for every term.
  • Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)
  • Run a clustering algorithm on document vectors.

Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.

Upvotes: 8

Franck Dernoncourt
Franck Dernoncourt

Reputation: 83157

The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. [...]

Upvotes: 2

Related Questions