Tag based clustering algorithm

Question

I am looking to cluster many feeds based on their tags. A typical example would be twitter feeds. Each feed will have user defined tags associated with it. By analyzing the tags , is it possible to cluster the feeds into different groups and tell so much feeds are based on so much tags. An example would be -

Feed1 - Earthquake in indonasia #earthquake #asia #bad
Feed2 - There is a large earthquake in my area #earthquake #bad
Feed3 - My parents went to singapore #asia #tour
Feed4 - XYZ company is laying off many people #XYZ #layoff #bear
Feed5 - XYZ is getting bad is planning to layoff #XYZ #layoff #bad
Feed6 - XYZ is in a layoff spree #layoff #XYZ #worst

After clustering

#asia , # earthquake - Feed1 , Feed2
#XYZ , # layoff - Feed4 , Feed 5 , Feed6

Here clustering is found purely on basis of tags. Is there any good algorithm to achieve this

Pulkit Goyal · Accepted Answer

If I understand your question correctly, you would like to cluster the tags together and then put the feeds into these clusters based on the tags in the feed.

For this, you could create a similarity measure between the tags based on the number of feeds that the tags appear in together. For your example, this would be something like this

               #earthquake | #asia | #bad | ...
#earthquake        1       |  1/2  |  2/2
#asia             1/2      |   1   |  1/2
#bad              2/3      |  1/3  |   1
...

Here, value at (i,j) equals frequency of (i,j)/frequency of (i).

Now you have a similarity matrix between the tags and you could virtually any clustering algorithm that suits your needs. Since, the number of tags can be very large and estimating the number of clusters is difficult before running the algorithm, I would suggest using some heirarchical clustering algorithm like Fast Modularity clustering which is also very fast (See some details here). However, if you have some estimate of the number of clusters that you would like to break this into, then Spectral clustering might be useful too (See some details here).

After you cluster the tags together, you could use a simple approach to assign each feed to a cluster. This can be very simple, for example, counting the number of tags from each cluster in a feed and assigning a cluster with the maximum number of matching tags.

If you are flexible on your clustering strategy, then you could also try clustering the feeds together in a similar way by creating a similarity between the feeds based on the number of common tags between the feeds and then applying a clustering algorithm on the similarity matrix.

Tag based clustering algorithm

Answers (2)

Algorithm

Performance Concerns

Related Questions