Nate
Nate

Reputation: 1948

Pandas apply with argument that varies by row

I have a data frame that contains 50 rows, for example the BCI data from R.

import pandas.rpy.common as com
varespec = com.load_data('BCI', 'vegan')

I am attempting to apply a function to each row, where the function takes a 'size' argument.

def rare(y, size):
    notabs = ~np.isnan(y)
    t = y[notabs]
    N = np.sum(t)
    diff = N - t
    rare = np.sum(1 - comb(diff, size)/comb(N, size))
    return rare

If size is an integer, it works fine:

varespec.apply(rare, axis=1, args=(20,))

What I would like to do is make size an array of 50 elements that all differ, so that each row has a unique value of size. If I make size a vector of 50, it passes the entire vector and the function doesn't work. How can I make

varespec.apply(rare, axis=1, args=(size,))

use a unique element of size for each row? I can do for loops:

for i in xrange(50):
    rare(varespec.iloc[i,:], size[i])

but is there a better way using apply functions?

Upvotes: 0

Views: 95

Answers (2)

unutbu
unutbu

Reputation: 879103

You could express the result as a calculation on whole NumPy arrays, rather than one done by calling rare once for each row of varespec:

import pandas as pd
import pandas.rpy.common as com
import scipy.misc as misc
import numpy as np
np.random.seed(1)

def rare(y, size):
    notabs = ~np.isnan(y)
    t = y[notabs]
    N = np.sum(t)
    diff = N - t
    rare = np.sum(1 - misc.comb(diff, size)/misc.comb(N, size))
    return rare

def using_rare(size):
    return np.array([rare(varespec.iloc[i,:], size[i]) for i in xrange(50)])

def using_arrays(size):    
    N = varespec.sum(axis='columns', skina=True)
    diff = (N[:, np.newaxis] - varespec.values).T
    return np.sum(1 - misc.comb(diff, size) / misc.comb(N, size), axis=0)

varespec = com.load_data('BCI', 'vegan')
size = np.random.randint(varespec.shape[1], size=(varespec.shape[0],))

This shows using_rare and using_arrays produce the same result:

expected = using_rare(size)
result = using_arrays(size)
assert np.allclose(result, expected)

In [229]: %timeit using_rare(size)
10 loops, best of 3: 36.2 ms per loop

In [230]: %timeit using_arrays(size)
100 loops, best of 3: 2.89 ms per loop

This takes advantage of the fact that scipy.misc.comb can accept NumPy arrays as input. So you can call comb(diff, size) where diff is an array of shape (225, 50) and size is an array of shape (50,). Since size is only used in the calls to comb, it is possible to perform all the calculations with just two calls to comb. No looping per row required.

Upvotes: 1

tktk
tktk

Reputation: 11734

You can add that vector as a column to your data frame (remove it later if you wish):

varespec['size'] = size

And then either change your rare function:

def rare(x):
    size = x['size']
    y = x.values[:-1]
    ...

Or if you don't want to change rare, wrap it:

def rare_wrapper(x):
    size = x['size']
    y = x.values[:-1]
    return rare(y, size)

Upvotes: 0

Related Questions