Reputation: 21800
Consider an arbitrary text box that records the answer to the question, what do you want to do before you die?
Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question).
The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing
Upvotes: 5
Views: 1245
Reputation: 16039
What you want is very much an open problem in NLP. @Ali's answer describes the idea at a high level, but the part "Construct a document vector for every answer" is the really hard one. There are a few obvious ways of building a document vector from a the vectors of the words it contains. Addition, multiplication and averaging are fast, but they affectively ignore the syntax. Man bites dog
and Dog bites man
will have the same representation, but clearly not the same meaning. Google compositional distributional semantics
- as far as I know, there are people at Universities of Texas, Trento, Oxford, Sussex and at Google working in the area.
Upvotes: 1
Reputation: 2579
It is much more difficult than string similarity. This is what you need to do at a minimum:
Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.
Upvotes: 8
Reputation: 83157
The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. [...]
Upvotes: 2