satoru
satoru

Reputation: 33225

How can I combine one column and a matrix into a larger matrix with `numpy`?

I'm trying to normalize a matrix by doing (X - means) / variance to each row.

Since I am implementing this with MapReduce, I first calculate the means and standard variance for each column, and then map each row with:

   matrix.map(lambda X: (X - means) / variance)

But I want to ignore the first element in each row X, which is my target column containing only 1s and 0s.

How can I do this?

Upvotes: 1

Views: 196

Answers (1)

Jaime
Jaime

Reputation: 67447

If A is a numpy array of shape (m, n + 1) and you also have arrays mu and s2 of shape (n,) holding the mean and variance of each column except the first one, you can do your normalization as follows:

A[:, 1:] = (A[:, 1:] - mu) / s2

To undestand wat goes on, you need to understand how broadcasting works. Since A[:, 1:] has shape (m, n) and mu and s2 shape (n,), these last two have 1s prepended to their shape to match the dimensions of the first, so they are treated as (1, n) arrays, and during the arithmetic operations the value in their first and only row is broadcasted to all rows.

If you are not already doing so, your meand and variance arrays can be calculated efficiently as

mu = (A[:, 1:].mean(axis=0)
s2 = A[:, 1:].var(axis=0)

For the variance you may want to use np.std squared to take advantage of the ddof argument, see the docs.

On a separate note, normalization is normally done dividing by the standard deviation, not the variance.

Upvotes: 2

Related Questions