Reputation: 1347
I have data that looks like this:
line1 = '-0.9821 1:15 2:20 4:10 8:10'
line2 = '0.1235 1:15 2:20 6:10 10:10'
line3 = '0.2132 1:15 3:20 5:10 9:10'
line4 = '0.328 2:15 4:20 6:10 7:12 8:16 10:10'
line5 = '0.973 2:15 3:20 6:10 8:12 9:10'
The first entry in each line is the output (Y) variable. The remaining entries represent sparse vectors (e.g., '1:15' means that at index 1, the X value is 15).
I am trying to calculate a predicted Y based on kNN estimation. I'm new to sparse matrices. I found some documentation that says I can use sparse matrices to estimate kNN:
knn = neighbors.KNeighborsClassifier(n_neighbors=2, weights='distance')
knn.fit(X, Y)
I am not sure how to create the X and Y matrices, and then how to predict Y given the kNN estimation. Any help for a beginner like me would be much appreciated.
Upvotes: 2
Views: 1918
Reputation: 4243
use a sparse array to store the data. Parse the string values into the sparse array then fit and predict on the knn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
line1 = '-0.9821 1:15 2:20 4:10 8:10'
line2 = '0.1235 1:15 2:20 6:10 10:10'
line3 = '0.2132 1:15 3:20 5:10 9:10'
line4 = '0.328 2:15 4:20 6:10 7:12 8:16 10:10'
line5 = '0.973 2:15 3:20 6:10 8:12 9:10'
data=[line1,line2,line3,line4,line5]
sparseMatrix = csr_matrix((5, 15),
dtype = np.float).toarray()
row=0
for item in data:
for entry in item.split(' '):
if ':' in entry:
index,value = entry.split(':')
sparseMatrix[row,int(index)]=value
else:
sparseMatrix[row,0]=entry
row+=1
X=sparseMatrix[:,1:15]
y=(sparseMatrix[:,0]*10).astype(int)
knn=KNeighborsClassifier(algorithm='auto',
leaf_size=10,
metric='minkowski',
metric_params=None,
n_jobs=1,
n_neighbors=3,
p=2,
weights='uniform')
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2, random_state=21)
knn.fit(X_train,y_train)
train_accuracy = knn.score(X_train, y_train)
test_accuracy=knn.score(X_test,y_test)
print(train_accuracy,test_accuracy)
for item in X:
prediction=knn.predict([item])
print(item,prediction)
y_pred=knn.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
output:
0.25 0.0
[15. 20. 0. 10. 0. 0. 0. 10. 0. 0. 0. 0. 0. 0.] [-9]
[15. 20. 0. 0. 0. 10. 0. 0. 0. 10. 0. 0. 0. 0.] [-9]
[15. 0. 20. 0. 10. 0. 0. 0. 10. 0. 0. 0. 0. 0.] [-9]
[ 0. 15. 0. 20. 0. 10. 12. 16. 0. 10. 0. 0. 0. 0.] [-9]
[ 0. 15. 20. 0. 0. 10. 0. 12. 10. 0. 0. 0. 0. 0.] [-9]
Upvotes: 0
Reputation: 21914
The short is that the format you're using is going to cause you a decent amount of grief. The long is that it's still absolutely possible to do this conversion, there's just a decent amount of goo-code that you're going to need. The first thing you're going to need to do is split each string on the first occurrence of space, and group the rest into x.
from scipy import sparse
y, _, x = line1.partition(" ")
y = float(y)
x = convert_to_csc(x, shape)
def convert_to_csc(x, shape):
sparse_matrix = sparse.csc_matrix(shape)
for entry in x.split():
index, value = entry.split(:)
sparse_matrix[index] = value
return sparse_matrix
I'll leave the rest as an exercise to the reader, but the rest should be pretty trivial. If you have the chance later on I would suggest relying on a more robust format.
To make it clear, aggregating the x
's and y
's in this example will give you the X
, and Y
in your code above. As far as getting the prediction out afterward, sklearn
uses the fit_transform
paradigm, meaning first you fit
, then you transform
. After you call fit above, you can get a prediction like so:
prediction = knn.transform(example_x)
I still think you should look into using sklearn
's SVR natively. I'd also highly suggest trying another model. Logistic Regression probably won't give you better performance than SVR in this case (though I could be wrong), but it would serve as an excellent testbed for any augmentations or general data tweaks you're thinking of adding, if for not reason other than the computational efficiency. SVR on the dataset you're talking about is... not going to run quickly.
Upvotes: 2