Reputation: 1045

python numpy weighted average with nans

First things first: this is not a duplicate of NumPy: calculate averages with NaNs removed, i'll explain why:

Suppose I have an array

a = array([1,2,3,4])

and I want to average over it with the weights

weights = [4,3,2,1]
output = average(a, weights=weights)
print output
     2.0

ok. So this is pretty straightforward. But now I have something like this:

a = array([1,2,nan,4])

calculating the average with the usual method yields of coursenan. Can I avoid this? In principle I want to ignore the nans, so I'd like to have something like this:

a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
     1.75

Upvotes: 16

Answers (6)

Ashwini Chaudhary

Reputation: 251051

First find out indices where the items are not nan, and then pass the filtered versions of a and weights to numpy.average:

>>> import numpy as np
>>> a = np.array([1, 2, np.nan,4])
>>> weights = np.array([4, 3, 2, 1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75

As suggested by @mtrw in comments, it would be cleaner to use masked array here instead of index array:

>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75

Upvotes: 14

SiP

Reputation: 1160

Since you're looking for the mean another idea is to simply replace all the nan values with 0's:

>>>import numpy as np
>>>a = np.array([[ 3.,  2.,  5.], [np.nan,  4., np.nan], [np.nan, np.nan, np.nan]])
>>>w = np.array([[ 1.,  2.,  3.], [np.nan, np.nan, np.nan], [np.nan, np.nan, np.nan]])
>>>a[np.isnan(a)] = 0
>>>w[np.isnan(w)] = 0
>>>np.average(a, weights=w)
3.6666666666666665

This can be used with the axis functionality of the average function but be carful that your weights don't sum up to 0.

Upvotes: 0

mmdfl

Reputation: 81

All the solutions above are very good, but has don't handle the cases when there is nan in weights. For doing so, using pandas :

def weighted_average_ignoring_nan(df, col_value, col_weight):
  den = 0
  num = 0
  for index, row in df.iterrows():
    if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
      den = den + row[col_weight]
      num = num + row[col_weight]*row[col_value]
  return num/den

Upvotes: 0

ZaxR

Reputation: 5165

Expanding on @Ashwini and @Nicolas' answers, here is a version that can also handle an edge case where all the data values are np.nan, and that is designed to also work with pandas DataFrame without type-related issues:

def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
                       weights: List[Union[float, int]]) -> np.ndarray:
    """ Calculates the weighted average of `measures`' values, ex-nans.

    When nans are present in  `measures`' values,
    the weights are recalculated based only on the weights for non-nan measures.

    Note:
        The calculation used is NOT the same as just ignoring nans.
        For example, if we had data and weights:
            data = [2, 3, np.nan]
            weights = [0.5, 0.2, 0.3]
            calc_wa_ignore_nan approach:
                (2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
            The ignoring nans approach:
                (2*0.5) + (3*0.2) == 1.6

    Args:
        data: Multiple rows of numeric data values with `measures` as column headers.
        measures: The str names of values to select from `row`.
        weights: The numeric weights associated with `measures`.

    Example:
        >>> df = pd.DataFrame({"meas1": [1, 1],
                               "meas2": [2, 2],
                               "meas3": [3, 3],
                               "meas4": [np.nan, 0],
                               "meas5": [5, 5]})
        >>> measures = ["meas2", "meas3", "meas4"]
        >>> weights = [0.5, 0.2, 0.3]
        >>> calc_wa_ignore_nan(df, measures, weights)
        array([2.28571429, 1.6])

    """
    assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
    # Need to coerce type to np.float instead of python's float
    # to avoid "ufunc 'isnan' not supported for the input types ..." error
    data = np.array(df[measures].values, dtype=np.float64)

    # Make a 2d array with the same weights for each row
    # cast for safety and better errors
    weights = np.array([weights, ] * data.shape[0], dtype=np.float64)

    mask = np.isnan(data)
    masked_data = np.ma.masked_array(data, mask=mask)
    masked_weights = np.ma.masked_array(weights, mask=mask)

    # np.nanmean doesn't support weights
    weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
    # Replace masked elements with np.nan
    # otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
    weighted_avgs = weighted_avgs.filled(np.nan)

    return weighted_avgs

Upvotes: 2

Aleksandr Tukallo

Reputation: 1427

I would offer another solution, which is more scalable to bigger dimensions (eg when doing average over different axis). Attached code works with 2D array, which possibly contains nans, and takes average over axis=0.

a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array

# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a)) 

# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)                                                         
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec

# mean_vec is vector with weighted nan-averages of array a taken along axis=0

Upvotes: 3

Nicolas Barbey

Reputation: 6797

Alternatively, you can use a MaskedArray as such:

>>> import numpy as np

>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> ma = np.ma.MaskedArray(a, mask=np.isnan(a))
>>> np.ma.average(ma, weights=weights)
1.75

Upvotes: 20

python numpy weighted average with nans

Answers (6)

Related Questions