Kevin
Kevin

Reputation: 35

Getting `ValueError: array must not contain infs or NaNs` even after using `np.nan_to_num()`

I'm getting this ValueError: array must not contain infs or NaNs even after I used np.nan_to_num().

As you can see in the code block below, I wrote a function that puts sentiment scores through the logit function to get a normal distribution. I got the ValueError above at first so I decided to try using np.nan_to_num() on the sentiment scores and then input them into the pearsonr function. However, even after that it is still creating the error. When I use logit and np.nam_to_num() in a separate unit test though and print the output, there are no inf or nan values (note: there were some infs before using np.nam_to_num() though, caused by the logit function, but they were all successfully replaced in the separate unit test that was printed). I'm not sure what's going on in the loop below though. For reference, new_all_data is just a pandas dataframe that I used to store all the various pieces of information. Any help is appreciated!

from scipy import special, stats
import numpy as np

def logit_correlate(sentiments, percent_changes):
    normalised_sentiments = special.logit(sentiments)
    normalised_sentiments_fixed = np.nan_to_num(normalised_sentiments)
    r, p_value = stats.pearsonr(normalised_sentiments_fixed, percent_changes)
    return r, p_value

for sentiment in ['weighted_sentiments', 'unweighted_sentiments', 'weighted_sentiments_DBSCAN', 'full_mean', 'full_mean_unrounded', 'full_mode']:
    r_1yr, p_value_1yr = logit_correlate(new_all_data[sentiment], new_all_data['% change 1 year'])
    print(f'{sentiment:10} 1 year r={r_1yr:10} p={p_value_1yr:10}')

Edit: I ran another test to make sure I wasn't losing my mind. Here is the modified logit_correlate function:

def logit_correlate(sentiments, percent_changes):
    normalised_sentiments = special.logit(sentiments)
    normalised_sentiments_fixed = np.nan_to_num(normalised_sentiments)
    print(np.isnan(normalised_sentiments_fixed).any())
    print(np.isinf(normalised_sentiments_fixed).any())
    print(np.isnan(percent_changes.values).any())
    print(np.isinf(percent_changes.values).any())
    r, p_value = stats.pearsonr(normalised_sentiments_fixed, percent_changes)
    return r, p_value

As you can see, I checked to make sure there were no nan or inf values in either of the input arrays for the pearson correlation (pearsonr). Everything returned false. The function breaks when the pearsonr function is called, both the logit and np.nan_to_num() functions run (I printed their output as well just to make sure). And yet it still is raising the ValueError.

Upvotes: 1

Views: 4273

Answers (1)

a_guest
a_guest

Reputation: 36249

np.nan_to_num replaces inf values with some very large number which can cause additional inf later on. scipy.stats.pearsonr computes the mean of the input arrays and already two inf in the original array are enough to cause an additional inf due to the sum of those large numbers exceeding the valid floating point range:

>>> np.nan_to_num(np.array([np.inf, np.inf])).mean()
inf

Depending on your needs you could either drop them completely via x[np.isfinite(x)] or use the posinf and neginv keyword arguments of nan_to_num in order to replace the inf with smaller values.

Upvotes: 2

Related Questions