Reputation: 33225
I'm trying to normalize a matrix by doing (X - means) / variance
to each row.
Since I am implementing this with MapReduce
, I first calculate the means and standard variance for each column, and then map
each row with:
matrix.map(lambda X: (X - means) / variance)
But I want to ignore the first element in each row X
, which is my target column containing only 1s and 0s.
How can I do this?
Upvotes: 1
Views: 196
Reputation: 67447
If A
is a numpy array of shape (m, n + 1)
and you also have arrays mu
and s2
of shape (n,)
holding the mean and variance of each column except the first one, you can do your normalization as follows:
A[:, 1:] = (A[:, 1:] - mu) / s2
To undestand wat goes on, you need to understand how broadcasting works. Since A[:, 1:]
has shape (m, n)
and mu
and s2
shape (n,)
, these last two have 1s prepended to their shape to match the dimensions of the first, so they are treated as (1, n)
arrays, and during the arithmetic operations the value in their first and only row is broadcasted to all rows.
If you are not already doing so, your meand and variance arrays can be calculated efficiently as
mu = (A[:, 1:].mean(axis=0)
s2 = A[:, 1:].var(axis=0)
For the variance you may want to use np.std
squared to take advantage of the ddof
argument, see the docs.
On a separate note, normalization is normally done dividing by the standard deviation, not the variance.
Upvotes: 2