Reputation: 1087
I have newspaper articles' corpus by day. Each word in the corpus has a frequency count of being present that day. I have been toying with finding an algorithm that captures the break-away words, similar to the way Twitter measures Trends in people's tweets.
For Instance, say the word 'recession' appears with the following frequency in the same group of newspapers:
Day 1 | recession | 456
Day 2 | recession | 2134
Day 3 | recession | 3678
While 'europe'
Day 1 | europe | 67895
Day 2 | europe | 71999
Day 3 | europe | 73321
I was thinking of taking the % growth per day and multiplying it by the log of the sum of frequencies. Then I would take the average to score and compare various words.
In this case:
recession = (3.68*8.74+0.72*8.74)/2 = 19.23
europe = (0.06*12.27+0.02*12.27)/2 = 0.49
Is there a better way to capture the explosive growth? I'm trying to mine the daily corpus to find terms that are more and more mentioned in a specific time period across time. PLEASE let me know if there is a better algorithm. I want to be able to find words with high non-constant acceleration. Maybe taking the second derivative would be more effective. Or maybe I'm making this way too complex and watched too much physics programming on the discovery channel. Let me know with a math example if possible Thanks!
Upvotes: 15
Views: 1398
Reputation: 36456
First thing to notice is that this can be approximated by a local problem. That is to say, a "trending" word really depends only upon recent data. So immediately we can truncate our data to the most recent N
days where N
is some experimentally determined optimal value. This significantly cuts down on the amount of data we have to look at.
In fact, the NPR article suggests this.
Then you need to somehow look at growth. And this is precisely what the derivative captures. First thing to do is normalize the data. Divide all your data points by the value of the first data point. This makes it so that the large growth of an infrequent word isn't drowned out by the relatively small growth of a popular word.
For the first derivative, do something like this:
d[i] = (data[i] - data[i+k])/k
for some experimentally determined value of k
(which, in this case, is a number of days). Similarly, the second derivative can be expressed as:
d2[i] = (data[i] - 2*data[i+k] + data[i+2k])/(2k)
Higher derivatives can also be expressed like this. Then you need to assign some kind of weighting system for these derivatives. This is a purely experimental procedure which really depends on what you want to consider "trending." For example, you might want to give acceleration of growth half as much weight as the velocity. Another thing to note is that you should try your best to remove noise from your data because derivatives are very sensitive to noise. You do this by carefully choosing your value for k
as well as discarding words with very low frequencies altogether.
I also notice that you multiply by the log sum of the frequencies. I presume this is to give the growth of popular words more weight (because more popular words are less likely to trend in the first place). The standard way of measuring how popular a word is is by looking at it's inverse document frequency (IDF).
I would divide by the IDF of a word to give the growth of more popular words more weight.
IDF[word] = log(D/(df[word))
where D
is the total number of documents (e.g. for Twitter it would be the total number of tweets) and df[word]
is the number of documents containing word
(e.g. the number of tweets containing a word).
A high IDF corresponds to an unpopular word whereas a low IDF corresponds to a popular word.
Upvotes: 8
Reputation: 4877
The problem with your approach (measuring daily growth in percentage) is that it disregards the usual "background level" of the word, as your example shows; 'europe' grows more quickly than 'recession', yet is has a much lower score.
If the background level of words has a well-behaved distribution (Gaussian, or something else that doesn't wander too far from the mean) then I think a modification of CanSpice's suggestion would be a good idea. Work out the mean and standard deviation for each word, using days C-N+1-T
to C-T
, where C is the current date, N is the number of days to take into account, and T is the number of days that define a trend.
Say for instance N=90 and T=3, so we use about three months for the background, and say a trend is defined by three peaks in a row. In that case, for example, you can rank the words according to their chi-squared p-value, calculated like so:
(mu, sigma) = fitGaussian(word='europe', startday=C-N+1-3, endday=C-3)
X1 = count(word='europe', day=C-2)
X2 = count(word='europe', day=C-1)
X3 = count(word='europe', day=C)
S = ((X1-mu)/sigma)^2 + ((X2-mu)/sigma)^2 + ((X3-mu)/sigma)^2
p = pval.chisq(S, df=3)
Essentially then, you can get the words which over the last three days are the most extreme compared to their background level.
Upvotes: 1
Reputation: 10658
I would first try a simple solution. A simple weighted difference between adjacent day should probably work. Maybe taking the log before that. You might have to experiment with the weights. For examle (-2,-1,1,2) would give you points where the data is exploding.
If this is not enough, you can try slope filtering ( http://www.claysturner.com/dsp/fir_regression.pdf ). Since the algorithm is based on linear regression, it should be possible to modify it for other types of regression (for example quadratic).
All attempts using filtering techniques such as these also have the advantage, that they can be made to run very fast and you should be able to find libraries that provide fast filtering.
Upvotes: 0