working with dataset in sklearn

Question

I have a dataset is in this format in .csv

id,interaction_flag,x_coordinate,y_coordinate,z_coordinate,hydrophobicity_kd,hydrophobicity_ww,hydrophobicity_hh,surface_tension,charge_cooh,charge_nh3,charge_r,alpha_helix,beta_strand,turn,van_der_walls,mol_wt,solublity  
229810,1,-33.8675148907451,-110.273691995647,100.021824089754,0.129381338742408,0.129381338742408,0.129381338742408,57.9996957403639,2.20539553752535,9.55985801217038,4.47146044624688,1.08064908722114,1.20135902636915,0.611653144016251,145.232251521298,107.951643002026,21.5344036511141        
229811,1,-26.9070290467923,-117.172163712053,106.980243932766,0.922048681541592,0.922048681541592,0.922048681541592,58.5383367139972,2.03983772819472,9.23210953346856,1.58401622717997,0.84178498985806,1.0387626774848,0.921703853955354,124.73630831643,84.1570182555755,10.7648600405665

I am trying to get Receiver Operating Characteristics (ROC) from this data using this link : http://scikit-learn.org/0.11/auto_examples/plot_roc.html

My target is interaction_flag column and test is all columns after interaction_flag. But, my program continue running in never ending state.

When I run the test example given in that link, it runs within a moment.

Can anyone let me know what wrong I am doing? or do I need to so something else to load my data like iris?

my code :

import numpy as np
import pylab as pl
from sklearn import svm, datasets
from sklearn.utils import shuffle
from sklearn.metrics import roc_curve, auc

training = 'dataset/training_5000_col.csv'
test = 'dataset/test_5000_col.csv'

random_state = np.random.RandomState(0)

# Import some data to play with
#iris = datasets.load_iris()
#X = iris.data
#y = iris.target
X = []
y = []
for line in open(training):
    z = line.rstrip().split(',')
y.append(int(z[2]))
tmp = []
for a in range(5, 15):
    tmp.append(float(z[a]))
X.append(tmp)
X_train = np.array(X)
y_train = np.array(y)



X1 = []
y1 = []
for line in open(test):
z = line.rstrip().split(',')
y1.append(int(z[2]))
tmp = []
for a in range(5, 15):
    tmp.append(float(z[a]))
X1.append(tmp)
X_test = np.array(X1)
y_test = np.array(y1)

# Run classifier
classifier = svm.SVC(kernel='linear', probability=True)
probas_ = classifier.fit(X_train, y_train).predict_proba(X_test)

# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
print "y_test : ", y_test
print "fpr : ", fpr
print "tpr : ", tpr
roc_auc = auc(fpr, tpr)
print "Area under the ROC curve : %f" % roc_auc

# Plot ROC curve
pl.clf()
pl.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
pl.plot([0, 1], [0, 1], 'k--')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiver operating characteristic example')
pl.legend(loc="lower right")
pl.show()

my .csv file is at : http://pastebin.com/iet5xQW2 how I will plot roc with this .csv

working with dataset in sklearn

Answers (1)

Related Questions