Reputation: 182
I've been given a youtube trending dataset with the assignment to make a predictive model which outputs the probability of a video getting into trending with at least 60% accuracy.
I have the title, channel, thumbnail_link, views, likes, dislikes, comments, date, ...
I've done some analyses and go figure the important columns are
category, tags(a "|" separated list)
The problem is that it's assumed all videos have trended so I can't use a classifier and fit it with training data to predict a trending yes/no column or use a regression algorithm without changing the goal to "predict how liked will it be" or something.
So it sounds like what I'm looking for is a clustering alg, I've looked into KMeans but as far as I can tell it won't do the trick
I'm thinking that I could compare video by video which categories and tags it contains and score it by the popularity of them or make a distance calculating similarity function but the implication is that I should use scikit
Upvotes: 1
Views: 96
Reputation: 1097
This sounds like a one-class classification problem. Some options are:
fit a representative distribution of the data, then for a new observation (video) check how likely it is to have come from that distribution
fit a classifier that will essentially find the boundaries of the data, then for a new observation tell you how far in/out-side of the boundary it is, for example scikit-learn.svm.OneClassSVM
fit cluster centers, or find archetypal examples, and then for a new observation tell how far it is from the cluster center compared to an average observation in the training data
Just some ideas, there are certainly other approaches. :)
Upvotes: 1