Reputation: 261
I'm trying to implement some machine learning algorithms, but I'm having some difficulties putting the data together.
In the example below, I load a example data-set from UCI, remove lines with missing data (thanks to the help from a previous question), and now I would like to try to normalize the data.
For many datasets, I just used:
valores = (valores - valores.mean()) / (valores.std())
But for this particular dataset the approach above doesn't work. The problem is that the mean function is returning inf
, perhaps due to a precision issue. See the example below:
bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
for col in bcw.columns:
if bcw[col].dtype != 'int64':
print "Removendo possivel '?' na coluna %s..." % col
bcw = bcw[bcw[col] != '?']
valores = bcw.iloc[:,1:10]
#mean return inf
print valores.iloc[:,5].mean()
My question is how to deal with this. It seems that I need to change the type of this column, but I don't know how to do it.
Upvotes: 6
Views: 26345
Reputation: 2279
The reason you are getting inf values can stem from multiple sources.
Overflow: As others have mentioned, this could be due to an overflow. If you're unfamiliar with this concept, you can read more about it on Wikipedia: Integer overflow. Essentially, computing statistics like the mean or standard deviation often involves summing all values in your dataset, which can lead to very large numbers and potential overflow.
NaN or inf Values: Another common issue is the presence of NaN or inf values in your DataFrame. These values can disrupt your calculations. To handle this, you can use a simple and fast trick to replace all infinity values with NaNs:
df.replace([np.inf, -np.inf], np.nan, inplace=True)
# Then compute:
(df - df.mean()) / df.std()
This works because NaNs are naturally ignored when computing statistics in pandas, whereas inf values are not. If you still encounter NaN values after this, it might be because your series consists entirely of NaNs. To check for this scenario, you can use the following function:
def is_only_nan_or_inf(df):
return df.isna().all(axis=0).any() or np.isposinf(df).all(axis=0).any()
It's also possible that your dataset contains rows full of NaNs. In such cases, you should drop these rows before performing any computations:
df.dropna(subset=["col1", "col2"], how="all", inplace=True)
Upvotes: 0
Reputation: 10355
For me, the reason was an overflow: my original data was in float16
and calling .mean()
on that would return inf
. After converting my data to float32
(e.g. via .astype("float32")
), .mean
worked as expected.
Upvotes: 1
Reputation: 14611
If the elements of the pandas series are strings you get inf
and the mean result. In this specific case you can simply convert the pandas series elements to float
and then calculate the mean. No need to use numpy.
Example:
valores.iloc[:,5].astype(float).mean()
Upvotes: 5
Reputation: 329
I had the same problem with a column that was of dtype 'o', and whose max value was 9999. Have you tried using the convert_objects
method with the convert_numeric=True
parameter? This fixed the problem for me.
Upvotes: 1
Reputation: 74172
NaN
values should not matter when computing the mean of a pandas.Series
. Precision is also irrelevant. The only explanation I can think of is that one of the values in valores
is equal to infinity.
You could exclude any values that are infinite when computing the mean like this:
import numpy as np
is_inf = valores.iloc[:, 5] == np.inf
valores.ix[~is_inf, 5].mean()
Upvotes: 3
Reputation: 1208
not so familiar with pandas but if you convert to a numpy array it works, try
np.asarray(valores.iloc[:,5], dtype=np.float).mean()
Upvotes: 5