gramme.ninja
gramme.ninja

Reputation: 1351

algorithm to get topic / focus of sentence out of words in sentence

Are there any well-know or successful algorithms for obtaining the topic and / or focus of a sentence ( question ) out of the words in the sentence question?

If not, how would I got about getting the topic / focus of the question. It seems that the topic / focus of the questions is usually a noun or a noun-phrase.

So the first thing I would do is determine the nouns by Part Of Speech tagging the question. but then how do I know if I should get just the nouns or the noun(s) and a adjective before it, or the noun and the adverb before it, or the noun(s) and verb?

For example:

In ' did the quick brown fox jump over the lazy dog ', get ' quick brown fox ', ' jump ', and ' lazy dog '.

In ' what is the population of japan ', get ' population ' and ' japan '

In ' what color is milk ' get ' color ' and ' milk '

In ' What is the height of Mt. Everest ' get ' Mt. Everst ' and ' Height '.

While writing these I guess the easiest way is removing stop words.

Upvotes: 3

Views: 2167

Answers (2)

CTsiddharth
CTsiddharth

Reputation: 907

This could be thought of as a parsing problem and I personally find the stanford nlp tool very effective .

Here is the link to the demo of the stanford parser

For the example , did the quick brown fox jump over the lazy dog The output you get is

did/VBD
the/DT
quick/JJ
brown/JJ
fox/NN
jump/VB
over/RP
the/DT
lazy/JJ
dog/NN

From the output you can write an extractor to extract the nouns ( adjectives and adverbs if need be) and thus obtain the topics from the sentence .

Moreover , the parse tree looks like

(ROOT
  (SINV (VBD did)
    (NP (DT the) (JJ quick) (JJ brown) (NN fox))
    (VP (VB jump)
      (PRT (RP over))
      (NP (DT the) (JJ lazy) (NN dog)))))

If you take a closer look at the parse tree , the output you are expecting are both the NP(noun phrases) - the quick brown fox and the lazy dog .

I hope this helps !

Upvotes: 3

Sara S
Sara S

Reputation: 679

I think first of all that the problem is language-dependent.

Secondly I think that if you have a set of words, you could run a check on their popularity/frequency in the language; f.e. the word "the" occurs much more often that the word "euphoric" => euphoric has more chance of being a proper keyword.

Here the importance of spelling is however crucial. How to deal with this? One idea is to use distance-algorithms such as Levenshtein to words that do not occur often (or do a google-search with the word and check if you get results or a "did-you-mean"-notification)

Some languages are though more structured that other. In english to find nouns, you can run first a check with "a/an word" and then words that end in "s" to find possible candidates for nouns. Then make a comparison with a dictionary.

With adjectives you can perhaps assume that a possible adjective will be located right before the noun. Then just compare the possible adjective with the dictionary.

Then you could of course keep a black-list of words that are never allowed as keywords.

The best solution would perhaps be to have a self-learning neural system but I'm not so familiar with those to give any suggestions

Upvotes: 4

Related Questions