williaster
williaster

Reputation: 93

Using plyr (ddply) with rpy2 syntax

As a learning exercise and because I'd like to do something similar with my own data, I'm trying to copy the answer to this example exactly but implement it in Python via rpy2.

This is turning out to be trickier than I thought because plyr uses a lot of convenient sytax (e.g. as.quoted variables, summarize, functions) that I haven't found easy to port to rpy2. Without even getting to the ggplot2 segment, this is what I've been able to manage so far, using **{} to allow use of the '.' arguments:

# import rpy2.robjects as ro
# from rpy2.robjects.packages import importr
# stats= importr('stats')
# plyr = importr('plyr')
# bs = importr('base')
# r = ro.r
# df = ro.DataFrame

mms = df( {'delicious': stats.rnorm(100), 
           'type':bs.sample(bs.as_factor(ro.StrVector(['peanut','regular'])), 100, replace=True),
           'color':bs.sample(bs.as_factor(ro.StrVector(['r','g','y','b'])), 100, replace=True)} )

# first define a function, then use it in ddply call
myfunc  = r('''myfunc <- function(var) {paste('n =', length(var))} ''')
mms_cor = plyr.ddply(**{'.data':mms, 
                        '.variables':ro.StrVector(['type','color']), 
                        '.fun':myfunc})

This runs without error, but printing the resulting mms_cor gives the following, which suggests the function isn't working correctly in the context of the ddply call (the length of the mms data.frame is 3, which is what I think is being calculated because other inputs to myfunc return different values):

     type color    V1
1  peanut     b n = 3
2  peanut     g n = 3
3  peanut     r n = 3
4  peanut     y n = 3
5 regular     b n = 3
6 regular     g n = 3
7 regular     r n = 3
8 regular     y n = 3 

Ideally I would get this to work with summarize, as done in the example answer, to have multiple calculations/label the output, but I couldn't get this to work either, and it really becomes awkward syntax-wise:

mms_cor = plyr.ddply(plyr.summarize, n=bs.paste('n =', bs.length('delicious')), 
                     **{'.data':mms,'.variables':ro.StrVector(['type','color'])})

This gives the same output as above with 'n = 1'. I know it's reflecting the length of the 1-item vector 'delicious', but can't figure out how to make this a variable instead of a string, or which variable it would be (which is why I moved toward the function above). Additionally, it would be useful to know how one might get the as.quoted variable syntax (e.g. ddply(.data=mms, .(type, color), ...)) to work with rpy2. I know plyr has several as_quoted methods, but I can't figure out how to use them because documentation and examples are tricky to find.

Any help is greatly appreciated. Thanks.

Edit:

lgautier's solution to fix myfunc with nrow not length.

myfunc = r('''myfunc <- function(var) {paste('n =', nrow(var))} ''')

Solution for ggplot2 if useful for others (note had to add x and y values to mms_cor as a workaround for using aes_string (can't get aes to work in Python environment):

#rggplot2 = importr('ggplot2') # note ggplot2 import above doesn't take 'mapping' kwarg
p = rggplot2.ggplot(data=mms, mapping=rggplot2.aes_string(x='delicious')) + \
    rggplot2.geom_density() + \
    rggplot2.facet_grid('type ~ color') + \
    rggplot2.geom_text(data=mms_cor, mapping=rggplot2.aes_string(x='x', y='y', label='V1'), colour='black', inherit_aes=False)

p.plot()

Upvotes: 2

Views: 428

Answers (1)

lgautier
lgautier

Reputation: 11565

Since you are using a callback, I can't resist showing one of the unexpected things rpy2 can do (note: the code is untested, there might be typos) :

def myfunc(var):
    # var is a data.frame, the length of
    # the first vector is the number of rows
    if len(var) == 0:
        nr = 0
    else:
        nr = len(var[0])
    # any string format feature in Python could
    # be used here
    return 'n = %i' % nr 

# create R function from the Python function
from rpy2.rinterface import rternalize
myfunc_r = rternalize(myfunc)

mms_cor = plyr.ddply(**{'.data':mms, 
                        '.variables':ro.StrVector(['type','color']), 
                        '.fun':myfunc_r})

Upvotes: 2

Related Questions