Algorithm for recognizing similar data?

Question

I've been given a youtube trending dataset with the assignment to make a predictive model which outputs the probability of a video getting into trending with at least 60% accuracy.

I have the title, channel, thumbnail_link, views, likes, dislikes, comments, date, ...

I've done some analyses and go figure the important columns are

category, tags(a "|" separated list)

The problem is that it's assumed all videos have trended so I can't use a classifier and fit it with training data to predict a trending yes/no column or use a regression algorithm without changing the goal to "predict how liked will it be" or something.

So it sounds like what I'm looking for is a clustering alg, I've looked into KMeans but as far as I can tell it won't do the trick

I'm thinking that I could compare video by video which categories and tags it contains and score it by the popularity of them or make a distance calculating similarity function but the implication is that I should use scikit

stevemo · Accepted Answer

This sounds like a one-class classification problem. Some options are:

fit a representative distribution of the data, then for a new observation (video) check how likely it is to have come from that distribution
fit a classifier that will essentially find the boundaries of the data, then for a new observation tell you how far in/out-side of the boundary it is, for example scikit-learn.svm.OneClassSVM
fit cluster centers, or find archetypal examples, and then for a new observation tell how far it is from the cluster center compared to an average observation in the training data

Just some ideas, there are certainly other approaches. :)

Algorithm for recognizing similar data?

Answers (1)

Related Questions