machine-learningnlprecommendation-engine

Reputation: 551

How to train a classifier with only positive and neutral data?

My question : How to train a classifier with only positive and neutral data?

I am building a personalized article recommendation system for education purposes. The data I use is from Instapaper.

Datasets

I only have positive data: - Articles that I have read and "liked", regardless of read/unread status

And neutral data (because I have expressed interest in it, but I may not like it later anyway): - Articles that are unread - Articles that I have read and marked as read but I did not "like" it

The data I do not have is negative data: - Articles that I did not send to Instapaper to read it later (I am not interested, although I have browsed that page/article) - Articles that I might not even have clicked into, but I might have or might not have archive it.

My problem

In such a problem, negative data is basically missing. I have thought of the following solution(s) but did not resolve to them yet:

1) Feed a number of negative data to the classifier Pros: Immediate negative data to teach the classifier Cons: As the number of articles I like increase, the negative data effect on the classifier dims out

2) Turn the "neutral" data into negative data Pros: Now I have all the positive and (new) negative data I need Cons: Despite the neutral data is of mild interest to me, I'd still like to get recommendations on such article, but perhaps as a less value class.

Upvotes: 26

Answers (7)

Francesco

Reputation: 75

As explained here, you could use LibSvm, specifically the option one-class SVM.

Hope it helps!

Upvotes: 2

Michael Aquilina

Reputation: 5540

This is obviously an old post but I have a similiar problem and hopefully you can save some time with information I found myself using the following techniques:

Upvotes: 2

zenog

Reputation: 685

If you have a lot of positive feedback by different users, you have a rather typical collaborative filtering scenario.

Here are some CF solutions:

kNN (either user- or item-based), e.g. using Cosine similarity
one matrix factorization approach (both papers have similar ideas)
- Y. Hu, Y. Koren, C. Volinsky: Collaborative filtering for implicit feedback datasets. ICDM 2008. http://research.yahoo.net/files/HuKorenVolinsky-ICDM08.pdf
- R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. M. Lukose, M. Scholz, Q. Yang: One-class collaborative filtering, ICDM 2008. http://www.hpl.hp.com/techreports/2008/HPL-2008-48R1.pdf
- both papers use ALS-like learning algorithms; of course you could also use SGD where you sample from the negative/unobserved examples
another matrix factorization approach (disclaimer: I am a co-author of this paper)
- Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, Lars Schmidt-Thieme: BPR: Bayesian Personalized Ranking from Implicit Feedback. UAI 2009. http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle_et_al2009-Bayesian_Personalized_Ranking.pdf

There exist publicly available implementations of those algorithms, e.g.

MyMediaLite (disclaimer: main author), http://mymedialite.net
Apache Mahout (disclaimer: fan and patch contributor), http://mahout.apache.org
GraphLab/GraphChi, http://graphlab.org/

By the way, if you use a classifier for such problems, have a look at the literature on positive-only learning, e.g. http://users.csc.tntech.edu/~weberle/Fall2008/CSC6910/Papers/posonly.pdf

Upvotes: 10

Lomilar

Reputation: 64

If you wanted to move away from a machine-learning example: TF-IDF can give you a weighted positive-only recommendation of similar articles to the articles you have liked (or viewed) and is very common for this use case.

More complex non-learning methods include LSA for determining document similarity, but it is non-trivial to implement, and the construction of the LSA 'space' does not scale above hundreds or thousands of documents without huge amounts of processing power.

Both of these lie in the field of Computational Linguistics.

Good luck!

Upvotes: 0

tysonjh

Reputation: 1339

Make two binary classifiers.

1 -> "liked" or not
2 -> "neutral" or not

You also have the option to chain them together to avoid having a case where something is both "liked" and "neutral". This will allow you to classify content.

As the other answer by @ThierryS has indicated, another option is to make a recommender that will allow you to suggest content that other similar users have identified as "liked" or "neutral" thus taking advantage of the social aspect.

Upvotes: 0

Rob Neuhaus

Reputation: 9300

The Spy EM algorithm solves exactly this problem.

S-EM is a text learning or classification system that learns from a set of positive and unlabeled examples (no negative examples). It is based on a "spy" technique, naive Bayes and EM algorithm.

The basic idea is to combine your positive set with a whole bunch of random documents, some of which you hold out. You initially treat all the random documents as the negative class, and learn a naive bayes classifier on that set. Now some of those crawled documents will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive document. Then you iterate this process until it stablizes.

Upvotes: 18

ThiS

Reputation: 977

What you are trying to do is more a recommender system than a classifier I think.

The state of the art is to use the content of each article and create a bag of words. From here you can compute distance from different articles. Articles with close similarities (using clustering or similarity like Pearson, Tanimoto) will be the one that you'll more likely want to read. This is the easiest way to have something quickly.

There are of course more sophisticated and accurate methods.

Upvotes: 0

How to train a classifier with only positive and neutral data?

Answers (7)

Related Questions