Vineeth Mohan
Vineeth Mohan

Reputation: 19283

Tag based clustering algorithm

I am looking to cluster many feeds based on their tags. A typical example would be twitter feeds. Each feed will have user defined tags associated with it. By analyzing the tags , is it possible to cluster the feeds into different groups and tell so much feeds are based on so much tags. An example would be -

After clustering

Here clustering is found purely on basis of tags. Is there any good algorithm to achieve this

Upvotes: 8

Views: 4366

Answers (2)

Pulkit Goyal
Pulkit Goyal

Reputation: 5664

If I understand your question correctly, you would like to cluster the tags together and then put the feeds into these clusters based on the tags in the feed.

For this, you could create a similarity measure between the tags based on the number of feeds that the tags appear in together. For your example, this would be something like this

               #earthquake | #asia | #bad | ...
#earthquake        1       |  1/2  |  2/2
#asia             1/2      |   1   |  1/2
#bad              2/3      |  1/3  |   1
...

Here, value at (i,j) equals frequency of (i,j)/frequency of (i).

Now you have a similarity matrix between the tags and you could virtually any clustering algorithm that suits your needs. Since, the number of tags can be very large and estimating the number of clusters is difficult before running the algorithm, I would suggest using some heirarchical clustering algorithm like Fast Modularity clustering which is also very fast (See some details here). However, if you have some estimate of the number of clusters that you would like to break this into, then Spectral clustering might be useful too (See some details here).

After you cluster the tags together, you could use a simple approach to assign each feed to a cluster. This can be very simple, for example, counting the number of tags from each cluster in a feed and assigning a cluster with the maximum number of matching tags.

If you are flexible on your clustering strategy, then you could also try clustering the feeds together in a similar way by creating a similarity between the feeds based on the number of common tags between the feeds and then applying a clustering algorithm on the similarity matrix.

Upvotes: 7

Eric Galluzzo
Eric Galluzzo

Reputation: 3241

Interesting question. I'm making things up here, but I think this would work.

Algorithm

For each feed, come up with a complete list of tag combinations (of length >= 2), probably sorted for consistency. For example:

  • Feed1: (asia-bad), (asia-earthquake), (bad-earthquake), (asia-bad-earthquake)
  • Feed2: (bad-earthquake)
  • Feed3: (asia-tour)
  • Feed4: (bear-layoff), (bear-XYZ), (layoff-XYZ), (bear-layoff-XYZ)
  • Feed5: (bad-layoff), (bad-XYZ), (layoff-XYZ), (bad-layoff-XYZ)
  • Feed6: (layoff-worst), (layoff-XYZ), (worst-XYZ), (layoff-worst-XYZ)

Then reverse the mapping:

  • (asia-bad): Feed1
  • (asia-earthquake): Feed1
  • (bad-earthquake): Feed1, Feed2
  • (asia-bad-earthquake): Feed1
  • (asia-tour): Feed3
  • (bear-layoff): Feed4
  • ...
  • (layoff-XYZ): Feed4, Feed5, Feed6
  • ...

You can then cull all the entries with a frequency higher than some threshold. In this case, if we take a frequency threshold of 2, then you'd get (bad-earthquake) with Feed1 and Feed2, and (layoff-XYZ) with Feed4, Feed5 and Feed6.

Performance Concerns

A naive implementation of this would have extremely poor performance -- exponential in the number of tags per feed (not to mention space requirements). However, there are various ways to apply heuristics to improve this. For example:

  1. Determine the most popular X tags by scanning all feeds (or a random selection of X feeds) -- this is linear in the number of tags per feed. Then only consider the Y most popular tags for each feed.
  2. Determine the frequency of all (or most) tags. Then, for each post, only consider the X most popular tags in that post. This prevents situations where you have, say, fifteen tags for some post, resulting in a huge list of combinations, most of which would never occur.
  3. For each post, only consider combinations of length <= X. For example, if a feed had fifteen tags, you could end up with a huge number of combinations, but most of them would have very few occurrences, especially the long ones. So only consider combinations of two or three tags.
  4. Only scan a random selection of X feeds.

Hope this helps!

Upvotes: 2

Related Questions