Reputation: 2429
for my question I created a dummy data frame:
set.seed(007)
DF <- data.frame(a = rep(LETTERS[1:5], each=2), b = sample(40:49), c = sample(1:10))
DF
a b c
1 A 49 2
2 A 43 3
3 B 40 7
4 B 47 1
5 C 41 9
6 C 48 8
7 D 45 6
8 D 42 5
9 E 46 10
10 E 44 4
How can I use the aggregation function on column a
so that, for instance, for "A" the following value is calculated: 49-43 / 2+3?
I started like:
aggregate(DF, by=list(DF$a), FUN=function(x) {
...
})
The problem I have is that I do not know how to access the 4 different cells 49, 43, 2 and 3
I tried x[[1]][1]
and similar stuff but don't get it working.
Upvotes: 1
Views: 379
Reputation: 115382
A data.table
solution - for efficiency of time and memory.
library(data.table)
DT <- as.data.table(DF)
DT[, list(calc = diff(b) / sum(c)), by = a]
Upvotes: 1
Reputation: 61154
Using data.table
could be faster and easier.
library(data.table)
DT <- data.table(DF)
DT[, (-1*diff(b))/sum(c), by=a]
a V1
1: A 1.2000000
2: B -0.8750000
3: C -0.4117647
4: D 0.2727273
5: E 0.1428571
Using aggregate, not so good. I didn't a better way to do it using aggregate
:( but here's an attempt.
B <- aggregate(DF$b, by=list(DF$a), diff)
C <- aggregate(DF$c, by=list(DF$a), sum)
data.frame(a=B[,1], Result=(-1*B[,2])/C[,2])
a Result
1 A 1.2000000
2 B -0.8750000
3 C -0.4117647
4 D 0.2727273
5 E 0.1428571
Upvotes: 2
Reputation: 23758
When you aggregate the FUN
argument can be anything you want. Keep in mind that the value passed will either be a vector
(if x is one column) or a little data.frame
or matrix
(if x is more than one). However, aggregate
doesn't let you access the columns of a multi-column argument. For example.
aggregate( . ~ a, data = DF, FUN = function(x) diff(x[,1]) / sum(x[,2]) )
That fails with an error even though I used .
(which takes all of the columns of DF that I'm not using elsewhere). To see what aggregate
is trying to do there look at the following.
aggregate( . ~ a, data = DF, FUN = sum )
The two columns, b, and c, were aggregated but from the first attempt we know that you can't do something that accesses each column separately. So, strictly sticking with aggregate you need two passes and three lines of code.
diffb <- aggregate( b ~ a, data = DF, FUN = diff )
Y <- aggregate( c ~ a, data = DF, FUN = sum )
Y$c <- diffb$b / Y$c
Now Y contains the result you want.
The by
function is simpler than aggregate
and all it does is split the original data.frame
using the indices and then apply the FUN
function.
l <- by( data = DF, INDICES = DF$a, FUN = function(x) diff(x$b)/sum(x$c), simplify = FALSE )
unlist(l)
You have to do a little to get the result back into a data.frame
if you really want one.
data.frame(a = names(l), x = unlist(l))
Upvotes: 2
Reputation: 57210
You can use the base by()
function:
listOfRows <-
by(data=DF,
INDICES=DF$a,
FUN=function(x){data.frame(a=x$a[1],res=(x$b[1] - x$b[2])/(x$c[1] + x$c[2]))})
newDF <- do.call(rbind,listOfRows)
Upvotes: 0
Reputation: 89057
Inside aggregate
, the function FUN
is applied independently to each column of your data. Here you want to use a function that takes two columns as inputs, so a priori, you can't use aggregate
for that.
Instead, you can use ddply
from the plyr
package:
ddply(DF, "a", summarize, res = (b[1] - b[2]) / sum(c))
# a res
# 1 A 1.2000000
# 2 B -0.8750000
# 3 C -0.4117647
# 4 D 0.2727273
# 5 E 0.1428571
Upvotes: 4