Ananda
Ananda

Reputation: 1572

How to extract meaningful keywords from a query?

I am working on a project on web intelligence in which I have to build a system which accepts user query and extract meaningful keywords. Say for example user enters a query "How to do socket programming in Java", then I have to ignore "how", "to", "do", "in" and take "socket", "programming", "java" for further processing and clustering e.g. socket and programming are two different meaningful keyword but can be used together as keyword which produce different meaning. I am looking for some algorithm like TF-IDF to approach this problem. Any help will be appreciated.

Upvotes: 1

Views: 1995

Answers (1)

jpsfer
jpsfer

Reputation: 594

Well what you are looking into a text analytics solution.

I have only used R for this purpose but one way to look at it is you need a list of words that you consider not meaningful keywords, this is often called "stop words". You can find online lists of stop words for almost any popular language. After doing this you might want to get a couple hundred inputs and calculate the frequency of every keyword there (having already removed stop words, as well as punctuation and having all text in lower-case) and try to identify other keywords that you think are irrelevant and add them to your list of words to remove.

After this there are a ton of options you can explore; an example would be stemming which is getting the core term of each word so that "pages" and "page" are considered the same keyword. (as you go deeper you will find a ton of stuff online to fine-tune your approach)

Hope this helps.

Upvotes: 3

Related Questions