Sunny Duckling
Sunny Duckling

Reputation: 357

Normalization: how to avoid zero standard deviation

Have the following task:

Normalize the matrix by columns. From each value in column subtract average (in column) and divide it by standard deviation (in the column). Your output should not contain nan (caused by division by zero). Replace Nans with 1. Don't use if and while/for.

I an working with numpy, so I wrote the following code:

def normalize(matrix: np.array) -> np.array:
    res = (matrix - np.mean(matrix, axis = 0)) / np.std(matrix, axis = 0, dtype=np.float64)
    return res
matrix = np.array([[1, 4, 4200], [0, 10, 5000], [1, 2, 1000]])
assert np.allclose(
    normalize(matrix),
    np.array([[ 0.7071, -0.39223,  0.46291],
              [-1.4142,  1.37281,  0.92582],
              [ 0.7071, -0.98058, -1.38873]])
)

The answer is right.

However, my question is: how do I avoid division by zero? If i have a column of similar numbers, I'll have standard deviation = 0 and the Nan value in result. How do I solve it? Would be grateful!

Upvotes: 1

Views: 1587

Answers (1)

FlyingTeller
FlyingTeller

Reputation: 20492

Your task specifies to avoid nan in the output and replace nan that occur with 1. It does not specify that intermediate results may not contain nan. A valid solution can be to use numpy.nan_to_num on res before returning:

import numpy as np
def normalize(matrix: np.array) -> np.array:
    res = (matrix - np.mean(matrix, axis = 0)) / np.std(matrix, axis = 0, dtype=np.float64)
    return np.nan_to_num(res, False, 1.0)
matrix = np.array([[2, 4, 4200], [2, 10, 5000], [2, 2, 1000]])
print(normalize(matrix))

yields:

[[ 1.         -0.39223227  0.46291005]
 [ 1.          1.37281295  0.9258201 ]
 [ 1.         -0.98058068 -1.38873015]]

Upvotes: 1

Related Questions