Different results when when using ddply and summarize. Due to different R and plyr versions?

Question

I'm looking to summarize data similar to the ToothGrowth data in the datasets package.

The output I want looks like this:

  supp   len  half   one   two
1   OJ 619.9 132.3 227.0 260.6
2   VC 508.9  79.8 167.7 261.4

That is the sum of lengths split by dose and supplement type. My colleague gets this output using R version 2.15.1 and plyr_1.7.1 using the following code.

library(datasets)           

x <- ToothGrowth

test <- ddply(x,c("supp"),summarize,
                     len = sum(len,na.rm=TRUE),
                     half = sum(len[dose==0.5],na.rm=TRUE),
                     one = sum(len[dose==1],na.rm=TRUE),
                     two = sum(len[dose==2],na.rm=TRUE))

There are no NAs in the ToothGrowth data but there are in the real dataset.

I get the following output R version 3.0.0 and and plyr_1.8. I can provide full sessionInfo() for both if that would be useful.

    supp    len half    one two
1   OJ    619.9 619.9   0   0
2   VC    508.9 508.9   0   0

This doesn't seem to bring up an error. In my data I only have three 'doses' but lots of 'supplement types'. Where there are no values in the half category it puts the whole sum into one, or two.

Is there a way in which I can do this that will produce a consistent output across versions types?

Thanks for your help.

joran · Accepted Answer

summarise was updated to "mutate by default" so to speak. So in the last three variables, when you refer to len, you are actually referring to the len variable you just created, which is only a single value. Call it something else:

test <- ddply(x,c("supp"),summarize,
+                      len1 = sum(len,na.rm=TRUE),
+                      half = sum(len[dose==0.5],na.rm=TRUE),
+                      one = sum(len[dose==1],na.rm=TRUE),
+                      two = sum(len[dose==2],na.rm=TRUE))
> test
  supp  len1  half   one   two
1   OJ 619.9 132.3 227.0 260.6
2   VC 508.9  79.8 167.7 261.4

(I originally mistakenly called this a change in ddply.) As for why, I suppose because it seemed like it would be convenient, and people requested the change. Here is a link to the issue raised and the subsequent patch.

Different results when when using ddply and summarize. Due to different R and plyr versions?

Answers (1)

Related Questions