Reputation: 3339
From Google Analytics I have a (long) list of keywords that people used in search engines to find my website. I want to find the 'core keywords', hypothetical example:
java online training
learning java
scala training
training for java
online training java
learn scala programming
The ideal result would be: 'java', 'online training', 'training', 'scala' and 'learn'.
The difficulty seems to be detecting complete phrases, ignoring common words (for) and handling variations (learn-learning).
Is there a library that can do that (preferably for JVM)? Or is there a suitable algorithm I can implement myself?
Upvotes: 5
Views: 1831
Reputation: 9875
This is a term or keyword extraction problem. I did a search and it turned up Kea, which looks to be very much what you want.
You can implement a naive solution by the following algorithm:
Like you said, this will have a problem with stopwords. You can do something simple like have a dictionary of stopwords, or you can do something like Term Frequency-Inverse Document Frequency which can help you automatically recognize very frequent terms. KEA will do this for you, it might be best to look into that first.
Hope that helps!
Upvotes: 3