Numpy's standard deviation method giving divide by zero error

Question

I have written a function to regularise a set of features in a machine learning algorithm. It takes a rectangular 2D numpy array features and returns its regularised version reg_features (I am using the Boston housing prices data from Scikit-learn for training). The exact code:

import tensorflow as tf
import numpy as np
from sklearn.datasets import load_boston
from pprint import pprint

def regularise(features):

    # Regularised features:
    reg_features = np.zeros(features.shape)

    for x in range(len(features)):
        for y in range(len(features[x])):

            reg_features[x][y] = (features[x][y] - np.mean(features[:, y])) / np.std(features[:, y])

    return reg_features

# Get the data
total_features, total_prices = load_boston(True)

# Keep 300 samples for training
train_features = regularise(total_features[:300])        # Works OK
train_prices = total_prices[:300]

# Keep 100 samples for validation
valid_features = regularise(total_features[300:400])     # Works OK
valid_prices = total_prices[300:400]

# Keep remaining samples as test set
test_features = regularise(total_features[400:])         # Does not work
test_prices = total_prices[400:]

Note that the I only get this error for the last call to regularise(), the one with total_features[400:]:

/Users/RohanSaxena/Documents/projects/sdc/tensor/reg.py:11: RuntimeWarning: invalid value encountered in double_scalars reg_features[x][y] = (features[x][y] - np.mean(features[:, y])) / np.std(features[:, y])

The rest of this code is with respect to the last call, that is regularise(total_features[400:])

To check if one of the standard deviations is a zero, I do this:

for y in range(len(features[0])):
    if np.std(features[:, y]) == 0.:
        print(np.std(features[:, y])

Which prints all zeros, that is:

0.0
0.0
...
0.0

a total of features[0].size times. This means that the standard deviation of every column in features is zero.

Now this seems very odd. So I print every single standard deviation to be sure:

for y in range(len(features[0])):
    print(np.std(features[:, y])

I get all non zero values:

10.9976293017
23.3483275632
6.63216140033
....
8.00329244499

How is this possible? Just before, prefixed by an if condition, this same code had given me all zeros, and now it gives non-zero values! This doesn't make any sense to me. Any help is appreciated.

Warren Weckesser · Accepted Answer

It is the subset of the data total_features[400:] that results in the problem. If you look at that data, you'll see that columns total_features[400:, 1] and total_features[400:, 3] are all 0. This causes a problem in your code, because the both the mean and standard deviation of those columns are 0, resulting in 0/0.

Instead of writing your own regularize function, you could use sklearn.preprocessing.scale. That function handles a constant column by returning a column that is all 0.

You can easily verify that scale does the same calculation as your regularise:

In [68]: test
Out[68]: 
array([[ 15.,   1.,   0.],
       [  3.,   4.,   5.],
       [  6.,   7.,   8.],
       [  9.,  10.,  11.],
       [ 12.,  13.,   1.]])

In [69]: regularise(test)
Out[69]: 
array([[ 1.41421356, -1.41421356, -1.20560706],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.72336423],
       [ 0.        ,  0.70710678,  1.44672847],
       [ 0.70710678,  1.41421356, -0.96448564]])

In [70]: from sklearn.preprocessing import scale

In [71]: scale(test)
Out[71]: 
array([[ 1.41421356, -1.41421356, -1.20560706],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.72336423],
       [ 0.        ,  0.70710678,  1.44672847],
       [ 0.70710678,  1.41421356, -0.96448564]])

The following shows how the functions handle a column of zeros:

In [72]: test[:,2] = 0

In [73]: test
Out[73]: 
array([[ 15.,   1.,   0.],
       [  3.,   4.,   0.],
       [  6.,   7.,   0.],
       [  9.,  10.,   0.],
       [ 12.,  13.,   0.]])

In [74]: regularise(test)
/Users/warren/miniconda3/bin/ipython:9: RuntimeWarning: invalid value encountered in double_scalars
Out[74]: 
array([[ 1.41421356, -1.41421356,         nan],
       [-1.41421356, -0.70710678,         nan],
       [-0.70710678,  0.        ,         nan],
       [ 0.        ,  0.70710678,         nan],
       [ 0.70710678,  1.41421356,         nan]])

In [75]: scale(test)
Out[75]: 
array([[ 1.41421356, -1.41421356,  0.        ],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.        ],
       [ 0.        ,  0.70710678,  0.        ],
       [ 0.70710678,  1.41421356,  0.        ]])

Numpy's standard deviation method giving divide by zero error

Answers (2)

Related Questions

Numpy&#39;s standard deviation method giving divide by zero error

Answers (2)

Related Questions

Numpy's standard deviation method giving divide by zero error