SpectralClustering
SpectralClustering

Reputation: 11

weird svm behavior in R (e1071)

I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).

Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.

Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.

Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.

Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))

library(e1071)
library(pROC)

colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){    
    #get train

    train=sample(seq(length(Y)),0.5*length(Y))
    if(min(table(Y[train]))==0)
    next

    #test from train
    test=seq(length(Y))[-train]

    #train model
    XX=X[train,]
    YY=Y[train]
    mod=svm(XX,YY,probability=FALSE)
    XXX=X[test,]
    predVec=predict(mod,XXX)
    RFans=attr(predVec,'decision.values')
    ansMat[test,i]=as.numeric(predVec)
}

ans=rowMeans(ansMat,na.rm=TRUE)

r=roc(Y,ans)$auc
print(r)

Similarly, when I implement the same thing in Python I get similar results.

Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
    # Get train/test index
    train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
    if len(np.unique(Y)) == 1:
        continue
    test = np.array([i for i in range(len(Y)) if i not in train])
    # train model
    mod = SVC(probability=False)
    mod.fit(X=X[train, :], y=Y[train])
    # predict and collect answer
    ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))`

Upvotes: 1

Views: 1522

Answers (2)

Elena Doe
Elena Doe

Reputation: 1

There are a couple of things to be mentioned here. First and foremost, as was previously touched upon by vvjn, the AUC should be computed within each CV loop. It also should be computed using predicted probabilities, rather than binary labels. Currently, you are predicting binary labels and then you take the mean of finite values (np.nanmean), which is not the same as using the predicted probabilities in each CV fold. You would need to change

mod = SVC(probability=False)
mod.fit(X=X[train, :], y=Y[train])
# predict and collect answer
ansMat[test, i] = mod.predict(X[test, :])

to

mod = SVC(probability=True)
pred_score = SVC.predict_proba(X[test, :])[:,1]
fpr_tmp, tpr_tmp, _ = roc_curve(Y[test], pred_tmp)

then append the auc_score to a list and finally compute the mean. This will yield average AUCs more close to 0.5. This will only work, by the way, if you code your labels as 0 and 1, which is more common in Python, instead of 1 and 2.

Minor things: Instead of

# Get train/test index
train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), 
                         replace=False, p=None)
if len(np.unique(Y)) == 1:
    continue

consider using a stratified train-test split, as available from scikit-learn to make the same number of samples per class available in each fold (i.e., to avoid strong data imbalance). Moreover, the number of input features (in your case 100) can influence your model's performance. It is generally recommended to keep the number of input features low, e.g., less than the included cases (in your case 14 for training, 14 for testing).

Upvotes: 0

vvjn
vvjn

Reputation: 106

You should consider each iteration of cross-validation to be an independent experiment, where you train using the training set, test using the testing set, and then calculate the model skill score (in this case, AUC).

So what you should actually do is calculate the AUC for each CV iteration. And then take the mean of the AUCs.

Upvotes: 1

Related Questions