Reputation: 984
I am trying to get a geometric mean in rows, the DataFrame looks like this:
PLTRNSU00013 PLSRBEX00014 PLATTFI00018 PLALMTL00023 PLAMBRA00013 PLAMICA00010 PLAMPLI00019 NL0000474351 PLAPATR00018 PLAPLS000016
2010-07-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2010-10-01 NaN NaN NaN NaN NaN NaN NaN 0.968237 NaN NaN
2011-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-04-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-07-01 0.979871 NaN NaN NaN NaN NaN NaN 1.00999 NaN NaN
2011-10-01 NaN NaN NaN NaN NaN 1.00737 NaN NaN NaN NaN
2012-01-01 NaN NaN NaN NaN NaN NaN NaN NaN 1.05766 NaN
2012-04-01 NaN NaN NaN NaN NaN NaN NaN NaN 0.979955 NaN
2012-07-01 NaN NaN NaN NaN NaN NaN NaN NaN 1.01718 NaN
2012-10-01 NaN 0.916302 NaN NaN NaN NaN NaN 0.979858 NaN NaN
I did not find any pandas buildin method, so I was using the gmean from scipy
from scipy.stats.mstats import gmean
but when called on on row:
In [285]:gmean(DataFrame.loc['2015-10-01'])
Traceback (most recent call last):
File "<ipython-input-28-e9186c65a04d>", line 1, in <module>
gmean(DataFrame.loc['2015-10-01'])
File "D:\Python\lib\site-packages\scipy\stats\stats.py", line 305, in gmean
log_a = np.log(np.array(a, dtype=dtype))
AttributeError: 'float' object has no attribute 'log'
I am getting and exception AttributeError
I tried getting rid of the Nan
with
In [287]: gmean(DataFrame.loc['2015-10-01'].dropna())
Traceback (most recent call last):
File "<ipython-input-29-e8807696d6be>", line 1, in <module>
gmean(DataFrame.loc['2015-10-01'].dropna())
File "D:\Python\lib\site-packages\scipy\stats\stats.py", line 305, in gmean
log_a = np.log(np.array(a, dtype=dtype))
AttributeError: 'numpy.float64' object has no attribute 'log'
I can do it manually with math
pow
method, but offcourse it is very inefficient, it works only with scalars.
Upvotes: 5
Views: 15140
Reputation: 180
from scipy.stats import gmean
df["gmean"] = df.apply(gmean, axis=1)
or if you have NaN
df['gmean'] = df.apply(lambda row: gmean(row.dropna()), axis=1)
Upvotes: 10
Reputation: 49
Looking at the scipy gmean code, it implements gmean(a)
as np.exp(np.average(np.log(a),weights, axis))
. This has been noted by other posters here.
Since both np.exp()
and np.log()
return dataframes if fed a dataframe, we can let pandas handle nans for us and do:
np.exp(np.log(a).mean(axis=1))
.
Upvotes: 0
Reputation: 71
The geometric mean is the nth root of the product of the series. That means it is also exp(mean of logs of values), like so using numpy:
df.apply(lambda x: np.exp(np.mean(np.log(x))), axis = 1)
Output
2010-07-01 NaN
2010-10-01 0.968237
2011-01-01 NaN
2011-04-01 NaN
2011-07-01 0.994817
2011-10-01 1.007370
2012-01-01 1.057660
2012-04-01 0.979955
2012-07-01 1.017180
2012-10-01 0.947547
dtype: float64
Upvotes: 3
Reputation: 4521
Not sure, why gmean didn't work for you. Have you already checked the datatypes of your columns? are they probably not np.float32 or np.float64?
A solution like proposed by Quang Hoang could work, but I guess only if the product does not produce an overflow. I immagine the gmean takes the logs, sums the results up, builds the average and returns e^(average).
You could of course do this by hand, if you like. Dropping the NAs before (or using an indexer that just ignores them).
So it would be something like:
import numpy as np
values= your_df.loc[your_id]
indexer= ~values.isna()
avg_log=values[indexer].map(np.log).mean()
np.exp(avg_log)
If you use this method, make sure to use at least np.float64 for the log and mean opreations.
Upvotes: 1
Reputation: 150825
We can do it manually with numpy
:
np.exp(np.log(df.prod(axis=1))/df.notna().sum(1))
Output:
2010-07-01 NaN
2010-10-01 0.968237
2011-01-01 NaN
2011-04-01 NaN
2011-07-01 0.994817
2011-10-01 1.007370
2012-01-01 1.057660
2012-04-01 0.979955
2012-07-01 1.017180
2012-10-01 0.947547
dtype: float64
Upvotes: 9