Reputation: 703
I am working on a project which is basically a knowledge based question answering system. My system takes query from the user, download the relevant documents from Wikipedia, strips all the html tags and extracts the plain text. After this, it tokenizes the document into sentences, then forms the term-document(TD) matrix(The query is also passed as a sentence). This TD matrix is then forwarded to pLSA(Probabilistic Latent Symentic Analysis) algorithm. Then, finally calculates the cosine similarity among the document(sentence) vectors with query vector. Based on the similarity with the query vector, the most relevant sentence is displayed as the answer. (Stemming is also done at the formation of TD Matrix). The problem is that is does displays the result, but not the most relevant. Where am I going wrong? Is the strategy I am following is correct, or any other algorithm does exists that may help?? Below I show some of the Question and their answers as returned by my system :
What is photosynthesis?
ANSWER 1 : The stroma contains stacks (grana) of thylakoids, which are the site of photosynthesis
ANSWER 2 : Factors leaf is the primary site of photosynthesis in plants
ANSWER 3 : Samuel Ruben and Martin Kamen used radioactive isotopes to determine that the oxygen liberated in photosynthesis came from the water
ANSWER 4 : In plants, algae and cyanobacteria, photosynthesis releases oxygen
Another question
What is Artificial Intelligence?
ANSWER 1 : the problem of creating 'artificial intelligence' will substantially be solved"
ANSWER 2 : 37 The leading-edge definition of artificial intelligence research is changing over time
ANSWER 3 : Stories of these creatures and their fates discuss many of the same hopes, fears and ethical concerns that are presented by artificial intelligence
ANSWER 4 : History of artificial intelligence and Timeline of artificial intelligence Thinking machines and artificial beings appear in Greek myths , such as Talos of Crete , the bronze robot of Hephaestus , and Pygmalion's Galatea 13 Human likenesses believed to have intelligence were built in every major civilization
Another question
Who is a hacker?
ANSWER 1 : 19 Hackers (short stories) Helba from the
ANSWER 2 : 16 Rafael Núñez aka RaFa was a notorious most wanted hacker by the FBI since 2001
ANSWER 3 : Often, this type of 'white hat' hacker is called an ethical hacker
ANSWER 4 : Hackers also commonly use port scanners
yet another run
What is biology?
ANSWER 1 : Molecular biology is the study of biology at a molecular level
ANSWER 2 : molecular biology studies the complex interactions of systems of biological molecules
ANSWER 3 : The similarities and differences between cell types are particularly relevant to molecular biology
ANSWER 4 : Contents History Foundations of modern biology 2
Upvotes: 3
Views: 1226
Reputation: 8225
This is a well studied problem called Question Answering (QA). I have provided a summary about QA in another answer. In particular, all of your examples would fall under the category of "definition questions", according to TREC. I suggest perusing some of the papers resulting from a query of "TREC definition questions" on Google or Google Scholar for ideas.
Upvotes: 2
Reputation: 15422
I think that it will be difficult to improve your system if you keep a full statistical approach. From a statistical NLP standpoint, you really do the right things. Now, you may fine-tune some parameters. To do that, you must build a training corpus by telling the system which answer is the right one... and then see which value the parameter has to take to give you this answer.
That being said, I don't think that fine-tuning parameters will improve your accuracy by more than 20% ~30%.
If you want to go further, you'll need a more semantic approach, and represent knowledge symbolically. Check for instance http://www.jfsowa.com/
Upvotes: 1