Reputation:
Services like fflick, mombo etc perform analysis of tweets about movies. They seem to process hundreds of thousands of tweets.
How do they match a tweet with a movie? For example, lets say there is a movie called "unknown". How do they determine if a tweet talks about unknown-the movie or unknown something else?
How are they able to collect so many tweets? Streaming API?
Do they maintain a list of movie names and check each tweet against this list, to find out if a tweet is referencing a particular movie?
Upvotes: 1
Views: 202
Reputation: 51226
The following are just my guesses.
Certainly a list of movie names is required. That's a necessary first step in trimming the tweets down to a subset that could possibly refer to a movie.
A movie title is either recognisable from the words themselves (e.g. "Terminator 2") or it requires the author to disambiguate it (e.g. "Unknown" -- or "Gone With the Wind", which could be referring to either the movie or the book). In the latter case, a variety of clues will be provided. Perhaps most obviously:
I expect that criteria like the above are used to assign probabilities for classification according to some weights, and that the usual techniques have been applied to tweak the weights to give good predictions. I would expect a supervised machine learning approach: essentially, have some humans classify a few hundred tweets, then optimise the weights for performance on some subset of this dataset, and finally test how well the chosen weights work for classifying the remainder of the dataset (this is to check that overfitting has not occurred).
Upvotes: 3