Reputation: 177
I'm currently going through Andrej Karpathy's Hacker's guide to Neural Networks. In Chapter 2: Machine Learning, Binary Classification, he gives an example of a (very basic) SVM. Here's Karpathy's code:
var a = 1, b = -2, c = -1; // initial parameters
for(var iter = 0; iter < 400; iter++) {
// pick a random data point
var i = Math.floor(Math.random() * data.length);
var x = data[i][0];
var y = data[i][1];
var label = labels[i];
// compute pull
var score = a*x + b*y + c;
var pull = 0.0;
if(label === 1 && score < 1) pull = 1;
if(label === -1 && score > -1) pull = -1;
// compute gradient and update parameters
var step_size = 0.01;
a += step_size * (x * pull - a); // -a is from the regularization
b += step_size * (y * pull - b); // -b is from the regularization
c += step_size * (1 * pull);
}
And the following is my version, in Python:
import numpy
import random
X = numpy.array([[1.2, 0.7],
[-0.3, 0.5],
[-3, -1],
[0.1, 1.0],
[3.0, 1.1],
[2.1, -3]])
labels = [1, -1, 1, -1, -1, 1]
a = 1
b = -2
c = -1
l = len(X)-1
steps = 400
for n in range(0, steps):
i = random.randint(0, l)
x = X[i][0]
y = X[i][1]
label = labels[i]
if n == 0:
for j in range(0, l+1):
x = X[j][0]
y = X[j][1]
label = labels[j]
score = a*x + b*y + c
print x,",",y,"-->", label, "vs.", score
score = a*x + b*y + c
pull = 0.0
if label == 1 and score < 1:
pull = 1
if label == -1 and score > -1:
pull = -1
step_size = 0.01
a += step_size * (x * pull - a)
b += step_size * (y * pull - b)
c += step_size * (1 * pull)
if n == steps-1:
print ""
for j in range(0, l+1):
x = X[j][0]
y = X[j][1]
label = labels[j]
score = a*x + b*y + c
print x,",",y,"-->", label, "vs.", score
The problem is, that even after more than the suggested 400 iterations, for some of the vectors, the parameters don't yield the correct label.
Here's the output after 400 iterations:
1.2 , 0.7 --> 1 vs. -0.939483353298
-0.3 , 0.5 --> -1 vs. -0.589208602761
-3.0 , -1.0 --> 1 vs. 0.651953448705
0.1 , 1.0 --> -1 vs. -0.921882586141
3.0 , 1.1 --> -1 vs. -1.44552077331
2.1 , -3.0 --> 1 vs. 0.896623596303
The first value after the "-->" is the correct label, second value is the score, i.e. learned label.
All vector/learned labels are correct (in the sense of being assigned a value with the correct sign), except for the first one.
I'm not sure what the reason is for this: Did I make a mistake in my code? I checked it a few times, but didn't find anything. Or am I forgetting something Python specific here. Or, finally, is there some ML related reason why the correct label isn't learned in this case. Doubt it though, otherwise it doesn't make sense that Karpathy got the right results.
Any comments or help in figuring it out much appreciated.
Upvotes: 2
Views: 85
Reputation: 77880
I believe that I found the problem(s):
(A) Your data set has no linear cut.
(B) Karpathy's "Monte Carlo" gradient descent thrashes on such a data set.
(C) You and Karpathy used different data.
DATA SETS
label Karpathy's yours
1 [1.2, 0.7] [1.2, 0.7]
-1 [-0.3, -0.5] [-0.3, 0.5]
1 [3.0, 0.1] [-3, -1]
-1 [-0.1, -1.0] [0.1, 1.0]
-1 [-1.0, 1.1] [3.0, 1.1]
1 [2.1, -3] [2.1, -3]
The data set you gave almost has a cut line (hyperplane) at roughly y = 1/3x + 1/2, but the three points closets to the line argue constantly about the division. As it turns out, the best divider is distinctly different, leaving [1.2, 0.7] seriously on the wrong side of the line, but quite unhappy about that.
The original data has a neat cut line at roughly y = -3x + 1, which this algorithm roughly approximates with (rounded) 0.6x - 0.1y - 0.5
Again, this algorithm is looking for a minimal-cost fit rather than the "pure" SVM of the widest separating channel. Even when there is a neat cut line, this algorithm does nto converge on it; rather, it hacks its way to the general vicinity of an acceptable solution.
It chooses a random point. If the point is solidly classified -- the score is of the right sign with a magnitude > 1 -- nothing happens. However, if the point is on the wrong side of the line, or even too close for comfort, then it pulls the parameters for more favourable treatment.
Unless the channel is 2 units wide, the points within the disputed territories will continue to take turns shoving it back and forth. There is no convergence criterion ... and, indeed, no convergence guarantee after a certain point.
Look carefully at Karpathy's code: the main algorithm makes changes for data points with scores < 1 or > -1 (depending on training class). However, the evaluation algorithm claims victory if the sign of the result is correct. This is reasonable, but it isn't entirely consistent with the training function. In my trials, that first point always has a score with magnitude < 0.07, but the actual value waffles on either side of 0. The other points are well clear of 0, but only two of them pass 1. There are four points arguing over where the line should be.
Does this clear things up for you?
Upvotes: 2