Reputation: 1948
I have a data frame that contains 50 rows, for example the BCI data from R.
import pandas.rpy.common as com
varespec = com.load_data('BCI', 'vegan')
I am attempting to apply a function to each row, where the function takes a 'size' argument.
def rare(y, size):
notabs = ~np.isnan(y)
t = y[notabs]
N = np.sum(t)
diff = N - t
rare = np.sum(1 - comb(diff, size)/comb(N, size))
return rare
If size is an integer, it works fine:
varespec.apply(rare, axis=1, args=(20,))
What I would like to do is make size an array of 50 elements that all differ, so that each row has a unique value of size. If I make size a vector of 50, it passes the entire vector and the function doesn't work. How can I make
varespec.apply(rare, axis=1, args=(size,))
use a unique element of size for each row? I can do for loops:
for i in xrange(50):
rare(varespec.iloc[i,:], size[i])
but is there a better way using apply functions?
Upvotes: 0
Views: 95
Reputation: 879103
You could express the result as a calculation on whole NumPy arrays, rather than one done by calling rare
once for each row of varespec
:
import pandas as pd
import pandas.rpy.common as com
import scipy.misc as misc
import numpy as np
np.random.seed(1)
def rare(y, size):
notabs = ~np.isnan(y)
t = y[notabs]
N = np.sum(t)
diff = N - t
rare = np.sum(1 - misc.comb(diff, size)/misc.comb(N, size))
return rare
def using_rare(size):
return np.array([rare(varespec.iloc[i,:], size[i]) for i in xrange(50)])
def using_arrays(size):
N = varespec.sum(axis='columns', skina=True)
diff = (N[:, np.newaxis] - varespec.values).T
return np.sum(1 - misc.comb(diff, size) / misc.comb(N, size), axis=0)
varespec = com.load_data('BCI', 'vegan')
size = np.random.randint(varespec.shape[1], size=(varespec.shape[0],))
This shows using_rare
and using_arrays
produce the same result:
expected = using_rare(size)
result = using_arrays(size)
assert np.allclose(result, expected)
In [229]: %timeit using_rare(size)
10 loops, best of 3: 36.2 ms per loop
In [230]: %timeit using_arrays(size)
100 loops, best of 3: 2.89 ms per loop
This takes advantage of the fact that scipy.misc.comb
can accept NumPy arrays as input. So you can call comb(diff, size)
where diff
is an array of shape (225, 50) and size
is an array of shape (50,). Since size
is only used in the calls to comb
, it is possible to perform all the calculations with just two calls to comb
. No looping per row required.
Upvotes: 1
Reputation: 11734
You can add that vector as a column to your data frame (remove it later if you wish):
varespec['size'] = size
And then either change your rare
function:
def rare(x):
size = x['size']
y = x.values[:-1]
...
Or if you don't want to change rare
, wrap it:
def rare_wrapper(x):
size = x['size']
y = x.values[:-1]
return rare(y, size)
Upvotes: 0