Jack Twain
Jack Twain

Reputation: 6372

Detecting trends in a data stream in real-time

I'm trying to detect trending topics on Twitter in real-time. What I'm doing is every time I get a tweet I assign the tweet to the cluster that talks about the same topic as the tweet. Regardless of the clustering algorithm I'm using or how I'm assigning tweets to topics, I'm unable to find how to detect a trending topic.

My understanding or definition of the trending cluster/topic is that it's one that's getting assigned tweets more than the other clusters during a certain period of time. Or the frequency of updating the cluster size is more than the other clusters.

How to convert that definition into actual code or a mathematical model is what I'm unable to solve.

This is an example for how the size of a trending cluster is developing: enter image description here

So as you see, the cluster size will be zero and then suddenly will start to increase because now it's a hot topic and tweets are being assigned to the cluster. Once the cluster is not a hot topic anymore, then the cluster size will remain relatively static.

Upvotes: 0

Views: 1259

Answers (2)

Azar
Azar

Reputation: 1

This may be too late for you given the date of the original question but maybe not. Given the time since your question you probably have the data to know what the trend line shape looks like. Now all you have to do is normalize it, for example between -1,0,1 and analyze your real time data, also normalized, for a match. If the difference between the real data and the line shape is below a threshold, the trend is in progress and you can sound the alarm.

a

Upvotes: 0

MvG
MvG

Reputation: 60858

It seems you are trying to detect situations where the curve of the graph you sketched has a slope above a certain threshold. But you don't have a continuous curve, instead you have sample points, one for every assignment of a tweet to a cluster. Two such sample points would in theory define a slope, but these slopes would look very bumpy: two tweets in close succession on an otherwise boring topic would suddenly make it trend. To avoid this, you'll have to smooth your data somehow. One possible way would be using a sliding window, spanning either a fixed amount of time (e.g. two hours) or a fixed number of tweets. So you could formulate a threshold like

  • it's trending if the number of tweets in the last x minutes exceeds y or
  • it's trending if the yth-to-last tweet was no longer than x minutes ago

Phrased as a threshold like this, the two formulations above are actually equivalent. If you'd have to measure trendyness using a single number, there would be a difference between these two.

If this simple approach doesn't work well for you, you might want to ask at Cross Validated and you might also investigate various peak detection algorithms, since this is essentially the problem of finding a peak in the slope function.

Upvotes: 3

Related Questions