Reputation: 408
I want to take a data.frame
some of whose columns are factors, summarize it in complex ways by factor groupings, and then assemble the result in a new summary data.frame.
This has got to be something people do all the time, but I can't seem to get it right. Here's a simplified example of the kind of thing I want to do:
> df
direction distance
1 south 83.40364
2 east 38.45644
3 west 92.29418
4 east 87.81878
5 north 99.62949
6 west 10.65441
7 south 58.06977
8 north 79.34895
> bydir <- by(df,df$direction,function(x) {
list(dir=x$direction[1], dist=sum(x$distance))})
> dirs <- data.frame()
> for (i in bydir) {dirs <- rbind(dirs, i)}
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "north") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "south") :
invalid factor level, NA generated
3: In `[<-.factor`(`*tmp*`, ri, value = "west") :
invalid factor level, NA generated
> dirs
dir dist
2 east 126.2752
21 <NA> 178.9784
3 <NA> 141.4734
4 <NA> 102.9486
I've looked at plyr
a bit and I bet I could get it to work for me, but my real question is why can't R accept new values of the dir
factor that aren't valid levels and simply add levels to the factor since I'm building the data.frame a little at a time? Even converting the factor to character
and setting stringsAsFactors = FALSE
in rbind
does not prevent R from trying to make that column a factor and producing NA's. I'd like a solution, but more than that, I'd like to understand what R is doing here.
Thanks,
Glenn
P.S. I found some interesting directions here: http://lamages.blogspot.com/2012/01/say-it-in-r-with-by-apply-and-friends.html but I haven't gotten any of them to work for my case yet.
Upvotes: 0
Views: 72
Reputation: 206232
The problem is your loop. You can't easily rbind to an empty data.frame with no columns. Luckly, that's completely avoidable.
bydir <- by(df,df$direction,function(x) {
list(dir=x$direction[1], dist=sum(x$distance))})
do.call(rbind.data.frame, bydir)
is better. It would be even better to return a data.frame rather than a generic list
bydir <- by(df,df$direction,function(x) {
data.frame(dir=x$direction[1], dist=sum(x$distance))})
do.call(rbind, bydir)
of course, by()
is overkill for this particular example. A simple aggregate
would do
aggregate(distance~direction, df, sum)
but I assume your real scenario is more complex.
Upvotes: 2