Reputation: 41
I have several algorithms which solve a binary classification (with response 0 or 1) problem by assigning to each observation a probability of the target value being equal to 1. All the algorithms try to minimize the log loss function where N is the number of observations, y_i is the actual target value and p_i is the probability of 1 predicted by the algorithm. Here is some R code with sample data:
actual.response = c(1,0,0,0,1)
prediction.df = data.frame(
method1 = c(0.5080349,0.5155535,0.5338271,0.4434838,0.5002529),
method2 = c(0.5229466,0.5298336,0.5360780,0.4217748,0.4998602),
method3 = c(0.5175378,0.5157711,0.5133765,0.4372109,0.5215695),
method4 = c(0.5155535,0.5094510,0.5201827,0.4351625,0.5069823)
)
log.loss = colSums(-1/length(actual.response)*(actual.response*log(prediction.df)+(1-actual.response)*log(1-prediction.df)))
The sample code gives the log loss for each algorithm:
method1 method3 method2 method4
0.6887705 0.6659796 0.6824404 0.6719181
Now I want to combine this algorithms so I can minimize the log loss even further. Is there any R package which can do this for me? I will appreciate references to any algorithms, articles, books or research papers which solve this kind of problem. Note that as a final result I want to have the predicted probabilities of each class and note plain 0,1 responses.
Upvotes: 2
Views: 733
Reputation: 2489
This is called ensemble learning (Wikipedia).
Check out this article: "an intro to ensemble learning in r."
Here is an example I did using the Cornell movie review data which can be downloaded by clicking the link. I used to data set with 1000 positive and 1000 negative reviews. Once you get the data into R:
library(RTextTools)
library(tm)
library(glmnet)
library(ipred)
library(randomForest)
library(data.table)
## create a column of sentiment score. 0 for negative and 1 for
## positive.
text_neg$pos_neg<-rep(0,1000)
text_pos$pos_neg<-rep(1,1000)
## Combine into 1 data.table and rename.
text_all<-rbind(text_neg, text_pos)
##dont forget to shuffle
set.seed(26)
text2<-text_all[sample(nrow(text_all)),]
## turn the data.frame into a document term matrix. This uses the handy
##RTextTools wrappers and functions.
doc_matrix <- create_matrix(text2$V1, language="english",
removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.98)
ncol(data.frame(as.matrix(doc_matrix)))
## 2200 variables at .98 sparsity. runs pretty slow...
## create a container with the very nice RTextTools package
container <- create_container(doc_matrix, text2$pos_neg,
trainSize=1:1700, testSize=1701:2000, virgin=FALSE)
## train the data
time_glm<-system.time(GLMNET <- train_model(container,"GLMNET"));
time_glm #1.19
time_slda<-system.time(SLDA <- train_model(container,"SLDA"));
time_slda #45.03
time_bag<-system.time(BAGGING <- train_model(container,"BAGGING"));
time_bag #59.24
time_rf<-system.time(RF <- train_model(container,"RF")); time_rf #69.59
## classify with the models
GLMNET_CLASSIFY <- classify_model(container, GLMNET)
SLDA_CLASSIFY <- classify_model(container, SLDA)
BAGGING_CLASSIFY <- classify_model(container, BAGGING)
RF_CLASSIFY <- classify_model(container, RF)
## summarize results
analytics <- create_analytics(container,cbind( SLDA_CLASSIFY,
BAGGING_CLASSIFY,RF_CLASSIFY, GLMNET_CLASSIFY))
summary(analytics)
This ran an ensemble classifier using the 4 different methods (random forests, GLM, SLD and bagging). The ensemble summary at the end shows
# ENSEMBLE SUMMARY
#
# n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
# n >= 1 1.00 0.86
# n >= 2 1.00 0.86
# n >= 3 0.89 0.89
# n >= 4 0.63 0.96
That if all 4 methods agreed on if the review was positive or negative, then the ensemble had a 96% recall rate. But be careful, because with a binary outcome (2 choices) and 4 different algorithms, there is bound to be a lot of agreement.
See the RTextTools
documentation for more explanation. They also do an almost identical example with U.S Congress data that I more or less mimicked in the above example.
Hope this was helpful.
Upvotes: 3