Reputation: 11
Currently, I'm learning about logistic regression and LDA (Linear Discriminant Analysis) classification. I'm trying to generate the data differently to learn logistic regression and LDA behavior.
Here is the data visualization of 2-dimensional predictors with class plotted as color:
Here is my code:
library(ggplot2)
library(MASS)
set.seed(1)
a <- mvrnorm(n = 1000, mu = c(0,0), Sigma = matrix(c(0.4,0,0,0.4), nrow = 2, ncol = 2))
b <- mvrnorm(n = 1000, mu = c(0,0), Sigma = matrix(c(10,0,0,10), nrow = 2, ncol =2 ))
#I want to make sure b1 separated from a
b1 <- b[sqrt(b[,1]^2 + b[,2]^2) > 4,]
df <- as.data.frame(rbind(a,b1))
names(df) <- c('x','y')
labelA <- rep('A', nrow(a))
labelB <- rep('B', nrow(b1))
#Put the label column to the data frame
df$labs <- c(labelA,labelB)
ggplot(df, aes(x = x, y = y, col = labs)) + geom_point()
prd <- glm(as.factor(labs) ~ x + y, family = binomial('probit'), data = df)
prd_score <- predict(prd, type = 'response')
plot(roc(df$labs,prd_score))
auc(roc(df$labs,prd_score))
Upvotes: 1
Views: 1213
Reputation: 7950
There is no mistake in your code.
Your dataset is a typical example that cannot be separated with a linear combination of features. Therefore linear classification method such as logistic regression or LDA won't help you here. This is why your ROC curve looks "weird", but it's totally normal and only telling you that your model fails to separate the data.
You need to investigate non-linear classification techniques. Given the radial distribution of the data, I can imagine that support vector machines (SVM) with a radial basis kernel could do the trick.
require(e1071)
# We need a numeric label for SVM regression
labelA <- rep(0, nrow(a))
labelB <- rep(1, nrow(b1))
df$labsNum <- c(labelA,labelB)
# We create a radial basis model
svm_prd <- svm(labsNum ~ x + y, data = df, kernel = "radial", type = "eps-regression")
svm_score <- predict(svm_prd)
plot(roc(df$labs,prd_score))
auc(roc(df$labs,prd_score))
Upvotes: 2