stephanos
stephanos

Reputation: 3339

How to cluster search engine keywords?

From Google Analytics I have a (long) list of keywords that people used in search engines to find my website. I want to find the 'core keywords', hypothetical example:

java online training
learning java
scala training
training for java
online training java
learn scala programming

The ideal result would be: 'java', 'online training', 'training', 'scala' and 'learn'.

The difficulty seems to be detecting complete phrases, ignoring common words (for) and handling variations (learn-learning).

Is there a library that can do that (preferably for JVM)? Or is there a suitable algorithm I can implement myself?

Upvotes: 5

Views: 1831

Answers (1)

sjr
sjr

Reputation: 9875

This is a term or keyword extraction problem. I did a search and it turned up Kea, which looks to be very much what you want.

You can implement a naive solution by the following algorithm:

  • generate a list of ngrams in the document with the phrase length that you want (chose an arbitrary phrase length limit, like 3 or 4)
  • put the ngram into a Multiset
  • iterate over the entries of the multiset in the order of their degree or count, perhaps with an arbitrary cutoff

Like you said, this will have a problem with stopwords. You can do something simple like have a dictionary of stopwords, or you can do something like Term Frequency-Inverse Document Frequency which can help you automatically recognize very frequent terms. KEA will do this for you, it might be best to look into that first.

Hope that helps!

Upvotes: 3

Related Questions