Reputation: 35
I'm getting this ValueError: array must not contain infs or NaNs
even after I used np.nan_to_num()
.
As you can see in the code block below, I wrote a function that puts sentiment scores through the logit
function to get a normal distribution. I got the ValueError
above at first so I decided to try using np.nan_to_num()
on the sentiment scores and then input them into the pearsonr
function. However, even after that it is still creating the error. When I use logit
and np.nam_to_num()
in a separate unit test though and print the output, there are no inf or nan values (note: there were some infs before using np.nam_to_num()
though, caused by the logit
function, but they were all successfully replaced in the separate unit test that was printed). I'm not sure what's going on in the loop below though. For reference, new_all_data
is just a pandas dataframe that I used to store all the various pieces of information. Any help is appreciated!
from scipy import special, stats
import numpy as np
def logit_correlate(sentiments, percent_changes):
normalised_sentiments = special.logit(sentiments)
normalised_sentiments_fixed = np.nan_to_num(normalised_sentiments)
r, p_value = stats.pearsonr(normalised_sentiments_fixed, percent_changes)
return r, p_value
for sentiment in ['weighted_sentiments', 'unweighted_sentiments', 'weighted_sentiments_DBSCAN', 'full_mean', 'full_mean_unrounded', 'full_mode']:
r_1yr, p_value_1yr = logit_correlate(new_all_data[sentiment], new_all_data['% change 1 year'])
print(f'{sentiment:10} 1 year r={r_1yr:10} p={p_value_1yr:10}')
Edit: I ran another test to make sure I wasn't losing my mind. Here is the modified logit_correlate
function:
def logit_correlate(sentiments, percent_changes):
normalised_sentiments = special.logit(sentiments)
normalised_sentiments_fixed = np.nan_to_num(normalised_sentiments)
print(np.isnan(normalised_sentiments_fixed).any())
print(np.isinf(normalised_sentiments_fixed).any())
print(np.isnan(percent_changes.values).any())
print(np.isinf(percent_changes.values).any())
r, p_value = stats.pearsonr(normalised_sentiments_fixed, percent_changes)
return r, p_value
As you can see, I checked to make sure there were no nan or inf values in either of the input arrays for the pearson correlation (pearsonr
). Everything returned false. The function breaks when the pearsonr
function is called, both the logit
and np.nan_to_num()
functions run (I printed their output as well just to make sure). And yet it still is raising the ValueError
.
Upvotes: 1
Views: 4273
Reputation: 36249
np.nan_to_num
replaces inf
values with some very large number which can cause additional inf
later on. scipy.stats.pearsonr
computes the mean
of the input arrays and already two inf
in the original array are enough to cause an additional inf
due to the sum of those large numbers exceeding the valid floating point range:
>>> np.nan_to_num(np.array([np.inf, np.inf])).mean()
inf
Depending on your needs you could either drop them completely via x[np.isfinite(x)]
or use the posinf
and neginv
keyword arguments of nan_to_num
in order to replace the inf
with smaller values.
Upvotes: 2