tshauck
tshauck

Reputation: 21554

Grouping Similar Strings

I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have similar effectiveness. For example,

Term               Group
NBA Basketball     1
Basketball NBA     1
Basketball         1
Baseball           2

It's a contrived example, but hopefully it explains what I'm trying to do. So then, what is the best way to do what I've described? I thought the nltk may have something along those lines, but I'm only barely familiar with it.

Thanks

Upvotes: 4

Views: 3284

Answers (1)

John Lehmann
John Lehmann

Reputation: 8225

You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).

nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.

import sys, operator

def tokenize(s, glen):
  g2 = set()
  for i in xrange(len(s)-(glen-1)):
    g2.add(s[i:i+glen])
  return g2

def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))

def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))

def main():
  GRAM_LEN = 4
  scores = {}
  for i in xrange(1,len(sys.argv)):
    for j in xrange(i+1, len(sys.argv)):
      s1 = sys.argv[i]
      s2 = sys.argv[j]
      score = dice(GRAM_LEN, s1, s2)
      scores[s1+":"+s2] = score
  for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
    print item

When this program is run with your strings, the following similarity scores are produced:

./dice.py "NBA Basketball" "Basketball NBA" "Basketball" "Baseball"

('NBA Basketball:Baseball', 0.125)
('Basketball NBA:Baseball', 0.125)
('Basketball:Baseball', 0.16666666666666666)
('NBA Basketball:Basketball NBA', 0.63636363636363635)
('NBA Basketball:Basketball', 0.77777777777777779)
('Basketball NBA:Basketball', 0.77777777777777779)

At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.

Upvotes: 9

Related Questions