user187809
user187809

Reputation:

How do services like fflick work? What algorithms do they use?

Services like fflick, mombo etc perform analysis of tweets about movies. They seem to process hundreds of thousands of tweets.

  1. How do they match a tweet with a movie? For example, lets say there is a movie called "unknown". How do they determine if a tweet talks about unknown-the movie or unknown something else?

  2. How are they able to collect so many tweets? Streaming API?

  3. Do they maintain a list of movie names and check each tweet against this list, to find out if a tweet is referencing a particular movie?

Upvotes: 1

Views: 202

Answers (1)

j_random_hacker
j_random_hacker

Reputation: 51226

The following are just my guesses.

Certainly a list of movie names is required. That's a necessary first step in trimming the tweets down to a subset that could possibly refer to a movie.

A movie title is either recognisable from the words themselves (e.g. "Terminator 2") or it requires the author to disambiguate it (e.g. "Unknown" -- or "Gone With the Wind", which could be referring to either the movie or the book). In the latter case, a variety of clues will be provided. Perhaps most obviously:

  • Anything that follows a phrase like "Just saw" or "Watched" is highly likely to be a movie name. Less so anything following "Read".
  • If the the name of the director or an actor in the film is mentioned, it's likely to be referring to the movie.
  • Twitter content is heavily skewed towards the latest thing, so the probability that a movie is being discussed drops as the time since the movie hit the theatres increases.
  • If a tweet is in response to a another tweet known with high probability to be referring to a particular movie, then it is probably about the same movie.

I expect that criteria like the above are used to assign probabilities for classification according to some weights, and that the usual techniques have been applied to tweak the weights to give good predictions. I would expect a supervised machine learning approach: essentially, have some humans classify a few hundred tweets, then optimise the weights for performance on some subset of this dataset, and finally test how well the chosen weights work for classifying the remainder of the dataset (this is to check that overfitting has not occurred).

Upvotes: 3

Related Questions