Rohan Saxena
Rohan Saxena

Reputation: 3328

Numpy's standard deviation method giving divide by zero error

I have written a function to regularise a set of features in a machine learning algorithm. It takes a rectangular 2D numpy array features and returns its regularised version reg_features (I am using the Boston housing prices data from Scikit-learn for training). The exact code:

import tensorflow as tf
import numpy as np
from sklearn.datasets import load_boston
from pprint import pprint

def regularise(features):

    # Regularised features:
    reg_features = np.zeros(features.shape)

    for x in range(len(features)):
        for y in range(len(features[x])):

            reg_features[x][y] = (features[x][y] - np.mean(features[:, y])) / np.std(features[:, y])

    return reg_features

# Get the data
total_features, total_prices = load_boston(True)

# Keep 300 samples for training
train_features = regularise(total_features[:300])        # Works OK
train_prices = total_prices[:300]

# Keep 100 samples for validation
valid_features = regularise(total_features[300:400])     # Works OK
valid_prices = total_prices[300:400]

# Keep remaining samples as test set
test_features = regularise(total_features[400:])         # Does not work
test_prices = total_prices[400:]

Note that the I only get this error for the last call to regularise(), the one with total_features[400:]:

/Users/RohanSaxena/Documents/projects/sdc/tensor/reg.py:11: RuntimeWarning: invalid value encountered in double_scalars reg_features[x][y] = (features[x][y] - np.mean(features[:, y])) / np.std(features[:, y])

The rest of this code is with respect to the last call, that is regularise(total_features[400:])

To check if one of the standard deviations is a zero, I do this:

for y in range(len(features[0])):
    if np.std(features[:, y]) == 0.:
        print(np.std(features[:, y])

Which prints all zeros, that is:

0.0
0.0
...
0.0

a total of features[0].size times. This means that the standard deviation of every column in features is zero.

Now this seems very odd. So I print every single standard deviation to be sure:

for y in range(len(features[0])):
    print(np.std(features[:, y])

I get all non zero values:

10.9976293017
23.3483275632
6.63216140033
....
8.00329244499

How is this possible? Just before, prefixed by an if condition, this same code had given me all zeros, and now it gives non-zero values! This doesn't make any sense to me. Any help is appreciated.

Upvotes: 0

Views: 2394

Answers (2)

Warren Weckesser
Warren Weckesser

Reputation: 114831

It is the subset of the data total_features[400:] that results in the problem. If you look at that data, you'll see that columns total_features[400:, 1] and total_features[400:, 3] are all 0. This causes a problem in your code, because the both the mean and standard deviation of those columns are 0, resulting in 0/0.

Instead of writing your own regularize function, you could use sklearn.preprocessing.scale. That function handles a constant column by returning a column that is all 0.

You can easily verify that scale does the same calculation as your regularise:

In [68]: test
Out[68]: 
array([[ 15.,   1.,   0.],
       [  3.,   4.,   5.],
       [  6.,   7.,   8.],
       [  9.,  10.,  11.],
       [ 12.,  13.,   1.]])

In [69]: regularise(test)
Out[69]: 
array([[ 1.41421356, -1.41421356, -1.20560706],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.72336423],
       [ 0.        ,  0.70710678,  1.44672847],
       [ 0.70710678,  1.41421356, -0.96448564]])

In [70]: from sklearn.preprocessing import scale

In [71]: scale(test)
Out[71]: 
array([[ 1.41421356, -1.41421356, -1.20560706],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.72336423],
       [ 0.        ,  0.70710678,  1.44672847],
       [ 0.70710678,  1.41421356, -0.96448564]])

The following shows how the functions handle a column of zeros:

In [72]: test[:,2] = 0

In [73]: test
Out[73]: 
array([[ 15.,   1.,   0.],
       [  3.,   4.,   0.],
       [  6.,   7.,   0.],
       [  9.,  10.,   0.],
       [ 12.,  13.,   0.]])

In [74]: regularise(test)
/Users/warren/miniconda3/bin/ipython:9: RuntimeWarning: invalid value encountered in double_scalars
Out[74]: 
array([[ 1.41421356, -1.41421356,         nan],
       [-1.41421356, -0.70710678,         nan],
       [-0.70710678,  0.        ,         nan],
       [ 0.        ,  0.70710678,         nan],
       [ 0.70710678,  1.41421356,         nan]])

In [75]: scale(test)
Out[75]: 
array([[ 1.41421356, -1.41421356,  0.        ],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.        ],
       [ 0.        ,  0.70710678,  0.        ],
       [ 0.70710678,  1.41421356,  0.        ]])

Upvotes: 1

epattaro
epattaro

Reputation: 2428

Usually when that happens first guess would be you are dividing the numerator by an int (rather than a float) larger than it, so the result is 0. However that does not seen to be the case here.

Sometimes the division is not doing what you expected it to be doing (term by term), rather, its an vector operation. However thats also not the case here.

The issue here is how you reference your dataframe

reg_features[x][y]

when dealing with dataframe and realocating values to specific cells you want to use the function loc

You can read more about it here http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html

Upvotes: 0

Related Questions