user969113
user969113

Reputation: 2429

R help on aggregation function

for my question I created a dummy data frame:

    set.seed(007)
 DF <- data.frame(a = rep(LETTERS[1:5], each=2), b = sample(40:49), c = sample(1:10))
 DF
   a  b  c
1  A 49  2
2  A 43  3
3  B 40  7
4  B 47  1
5  C 41  9
6  C 48  8
7  D 45  6
8  D 42  5
9  E 46 10
10 E 44  4

How can I use the aggregation function on column a so that, for instance, for "A" the following value is calculated: 49-43 / 2+3?

I started like:

aggregate(DF, by=list(DF$a), FUN=function(x) {

  ...

})

The problem I have is that I do not know how to access the 4 different cells 49, 43, 2 and 3 I tried x[[1]][1] and similar stuff but don't get it working.

Upvotes: 1

Views: 379

Answers (5)

mnel
mnel

Reputation: 115382

A data.table solution - for efficiency of time and memory.

library(data.table)
DT <- as.data.table(DF)

DT[, list(calc = diff(b) / sum(c)), by = a]

Upvotes: 1

Jilber Urbina
Jilber Urbina

Reputation: 61154

Using data.table could be faster and easier.

library(data.table)
DT <- data.table(DF)
DT[, (-1*diff(b))/sum(c), by=a]
  a         V1
1: A  1.2000000
2: B -0.8750000
3: C -0.4117647
4: D  0.2727273
5: E  0.1428571

Using aggregate, not so good. I didn't a better way to do it using aggregate :( but here's an attempt.

B <- aggregate(DF$b, by=list(DF$a), diff)
C <- aggregate(DF$c, by=list(DF$a), sum)
data.frame(a=B[,1], Result=(-1*B[,2])/C[,2])
  a     Result
1 A  1.2000000
2 B -0.8750000
3 C -0.4117647
4 D  0.2727273
5 E  0.1428571

Upvotes: 2

John
John

Reputation: 23758

When you aggregate the FUN argument can be anything you want. Keep in mind that the value passed will either be a vector (if x is one column) or a little data.frame or matrix (if x is more than one). However, aggregate doesn't let you access the columns of a multi-column argument. For example.

aggregate( . ~ a, data = DF, FUN = function(x) diff(x[,1]) / sum(x[,2]) )

That fails with an error even though I used . (which takes all of the columns of DF that I'm not using elsewhere). To see what aggregate is trying to do there look at the following.

aggregate( . ~ a, data = DF, FUN = sum )

The two columns, b, and c, were aggregated but from the first attempt we know that you can't do something that accesses each column separately. So, strictly sticking with aggregate you need two passes and three lines of code.

diffb <- aggregate( b ~ a, data = DF, FUN = diff )
Y <- aggregate( c ~ a, data = DF, FUN = sum )
Y$c <- diffb$b / Y$c

Now Y contains the result you want.

The by function is simpler than aggregate and all it does is split the original data.frame using the indices and then apply the FUN function.

l <- by( data = DF, INDICES = DF$a, FUN = function(x) diff(x$b)/sum(x$c), simplify = FALSE )
unlist(l)

You have to do a little to get the result back into a data.frame if you really want one.

data.frame(a = names(l), x = unlist(l))

Upvotes: 2

digEmAll
digEmAll

Reputation: 57210

You can use the base by() function:

listOfRows <- 
by(data=DF,
   INDICES=DF$a,
   FUN=function(x){data.frame(a=x$a[1],res=(x$b[1] - x$b[2])/(x$c[1] + x$c[2]))})

newDF <- do.call(rbind,listOfRows)

Upvotes: 0

flodel
flodel

Reputation: 89057

Inside aggregate, the function FUN is applied independently to each column of your data. Here you want to use a function that takes two columns as inputs, so a priori, you can't use aggregate for that.

Instead, you can use ddply from the plyr package:

ddply(DF, "a", summarize, res = (b[1] - b[2]) / sum(c))
#   a        res
# 1 A  1.2000000
# 2 B -0.8750000
# 3 C -0.4117647
# 4 D  0.2727273
# 5 E  0.1428571

Upvotes: 4

Related Questions