Doc
Doc

Reputation: 3

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1]. i.e 0 1:15.0 2:40.0 3:30.0 4:15.0 1 1:22.73 2:40.91 3:36.36 4:0.0 1 1:31.82 2:27.27 3:22.73 4:18.18 0 1:22.73 2:13.64 3:36.36 4:27.27 1 1:30.43 2:39.13 3:13.04 4:17.39 ......................

My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated i.e.

svm_type c_svc kernel_type rbf gamma 1 nr_class 2 total_sv 441 rho -0.156449 label 0 1 nr_sv 228 213 SV

Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened? Thanks in advance

Updated.........

I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.

To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......

0 1:22 2:30 3:14 4:16

1 1:26 2:15 3:17 4:25

0 1:22 2:30 3:14 4:16

Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?

Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.

NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).

the format of the input file is:

Where:

Column 0 - label (i.e 1 or 0): 1=Yes & 0=No

Column 1 - Feature 1 = Percentage Content "A"

Column 2 - Feature 2 = Percentage Content "U"

Column 3 - Feature 3 = Percentage Content "G"

Column 4 - Feature 4 = Percentage Content "C"

The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):

1 1:23 2:36 3:23 4:18

1 1:23 2:36 3:23 4:18

0 1:36 2:32 3:5 4:27

1 1:14 2:41 3:36 4:9

1 1:18 2:50 3:18 4:14

0 1:36 2:23 3:23 4:18

0 1:15 2:40 3:30 4:15

In terms of software, I am using libsvm-3.22 and python 2.7.5

Upvotes: 0

Views: 138

Answers (1)

papaya
papaya

Reputation: 1535

Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument. Try the following code in python to run Requirements - h5py, if your input is from matlab. (.mat file) pip install h5py

import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
    data = var[1]
    lables = (data.value[0])
trainlabels= []
for i in lables:
    trainlabels.append(str(i))

finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
    if trainlabels[i] == '0.0':
        trainlabels[i] = '0'
    if trainlabels[i] == '1.0':
    trainlabels[i] = '1'    
    print trainlabels[i]


f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []

file = open('traindata.txt', 'w+')

for var in variables:
    data = var[1]

lables = data.value

for i in range(0,1000): #no of training samples in file features.mat
    file.write(str(trainlabels[i]))
    file.write(' ')
    for j in range(0,49):
        file.write(str(lables[j][i]))
        file.write(' ')
    file.write('\n')

Upvotes: 0

Related Questions