Brian
Brian

Reputation: 963

Computing Trending Topics

Let's say I'm collecting tweets from twitter based on a variety of criteria and storing these tweets in a local mysql database. I want to be able to computer trending topics, like twitter, that can be anywhere from 1-3 words in length.

Is it possible to write a script to do something like this PHP and mysql?

I've found answering on how to compute which terms are "hot" once you're able to get counts of the terms, but I'm stuck at the first part. How should I store the data in the database, how can I count frequency of terms in the database that are 1-3 words in length?

Upvotes: 5

Views: 1568

Answers (4)

judotens
judotens

Reputation: 21

trending topic receipt from me :
1. fetch the tweets
2. split each tweets by space into n-gram (up to 3 gram if you want 3 words length) array
3. filter out each array from url, @username, common words and junk chars
4. count all unique keyword / phrase frequency
5. mute some junk word / phrase

yes, you can do it on php & mysql ;)

Upvotes: 2

user257111
user257111

Reputation:

Or do the opposite of Dominik and store a set list of phrases you wish to match, spaces and all. Write them as regex strings. For each row in database (file, sql table, whatever), process regex, find count.

It depends on which way around you want to do it trivially: everything - that which is common, thereby finding what is truly trending, or set phrase lookup. In one case, you'll find a lot that might not interest you and you'll need an extensive blocklist - in the other case, you'll need a huge whitelist.

To go beyond that, you need natural language processing tools to determine the meaning of what is said.

Upvotes: 0

Artjom Kurapov
Artjom Kurapov

Reputation: 6155

What you need is either

  1. document classification, or..
  2. automatic tagging

Probably second one. And only then you can count their popularity in time.

Upvotes: 1

Dominik
Dominik

Reputation: 1202

How about decomposing your tweets first in single word tokens and calculate for every word its number of occurrences ? Once you have them, you could decompose in all two word tokens, calculate the number of occurrences and finally do the same with all three word tokens.

You might also want to add some kind of dictionary of words you don't want to count

Upvotes: 1

Related Questions