Reputation: 15458
I am trying to use ddply to my sample data (call Z) which look like as below:
id y
1001 10
1001 11
1200 12
2001 10
2030 12
2100 32
3100 10
3190 13
4100 45
5100 67
5670 56
...
10001 54
10345 45
11234 32
and so on
My purpose is the find the sum of the y for the id starting with 1 (i.e.1001,1200,..), 2(2100), 3(3100,3190), 4,...10,11,...65. For example, for id starting with 1 , the sum is 10+11+12=33, for id starting with 2, it is 32.
I wanted to use the apply function which looks like as follows:
>s <- split(z,z$id)
>lapply(s, function(x) colSums(x[, c("y")]))
However, this gives me the sum by each of the unique id, not the one as I was looking for. Any suggestion in this regard would be highly appreciated.
Upvotes: 1
Views: 661
Reputation: 115390
Here is a data.table
solution that uses %/%
to perform integer division (return how many thousands)
library(data.table)
DT <- data.table(z)
x <- DT[,list(sum_y = sum(y)), by = list(id = id %/% 1000)]
x
id sum_y
1: 1 33
2: 2 54
3: 3 23
4: 4 45
5: 5 123
6: 10 99
You could do the similar with ddply
ddply(z, .(id = id %/% 1000 ), summarize, sum_y = sum(y))
id sum_y
1 1 33
2 2 54
3 3 23
4 4 45
5 5 123
6 10 99
Upvotes: 5
Reputation: 93908
Does this give you the intended answer?
z <- read.table(textConnection("id y
1001 10
1001 11
1200 12
2001 10
2030 12
2100 32
3100 10
3190 13
4100 45
5100 67
5670 56
10001 54
10345 45"),header=TRUE)
result <- tapply(
z$y,
as.numeric(substr(z$id,1,nchar(z$id)-3)),
sum
)
result
1 2 3 4 5 10
33 54 23 45 123 99
To steal @mnel's line from above, this could be simplified to:
result <- tapply(
z$y,
z$id %/% 1000,
sum
)
Upvotes: 3
Reputation: 109884
thelatemail provides a valid approach but I want to point out the problem isn't really with your understanding of lapply
(your code was almost correct) but with thinking about grouping. thelatemail does this in his solution and that's the key. I'm going to show you with your approach and then how I would actually approach this and then using ave
just because I never get to use it :)
Read in data
z <- read.table(textConnection("id y #stole this from the latemail
1001 10
1001 11
1200 12
2001 10
2030 12
2100 32
3100 10
3190 13
4100 45
5100 67
5670 56
10001 54
10345 45"),header=TRUE)
Your code adjusted
s <- split(z, substring(as.character(z$id), 1, nchar(as.character(z$id)) - 3))
lapply(s, function(x) sum(x[, "y"]))
Approach I would likely take; add a new factor id variable
z$IDgroup <- substring(as.character(z$id), 1, nchar(as.character(z$id)) - 3)
aggregate(y ~ IDgroup, z, sum)
#similar approach but adds the solution back as a new column
z$group.sum <- ave(z$y, z$IDgroup, FUN=sum)
z
Upvotes: 3