theoden
theoden

Reputation: 408

How do I convert a "by" data structure into a data.frame with factors

I want to take a data.frame some of whose columns are factors, summarize it in complex ways by factor groupings, and then assemble the result in a new summary data.frame. This has got to be something people do all the time, but I can't seem to get it right. Here's a simplified example of the kind of thing I want to do:

> df
  direction distance
1     south 83.40364
2      east 38.45644
3      west 92.29418
4      east 87.81878
5     north 99.62949
6      west 10.65441
7     south 58.06977
8     north 79.34895
> bydir <- by(df,df$direction,function(x) {
    list(dir=x$direction[1], dist=sum(x$distance))})
> dirs <- data.frame()
> for (i in bydir) {dirs <- rbind(dirs, i)}
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "north") :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "south") :
  invalid factor level, NA generated
3: In `[<-.factor`(`*tmp*`, ri, value = "west") :
  invalid factor level, NA generated
> dirs
    dir     dist
2  east 126.2752
21 <NA> 178.9784
3  <NA> 141.4734
4  <NA> 102.9486

I've looked at plyr a bit and I bet I could get it to work for me, but my real question is why can't R accept new values of the dir factor that aren't valid levels and simply add levels to the factor since I'm building the data.frame a little at a time? Even converting the factor to character and setting stringsAsFactors = FALSE in rbind does not prevent R from trying to make that column a factor and producing NA's. I'd like a solution, but more than that, I'd like to understand what R is doing here.

Thanks,

Glenn

P.S. I found some interesting directions here: http://lamages.blogspot.com/2012/01/say-it-in-r-with-by-apply-and-friends.html but I haven't gotten any of them to work for my case yet.

Upvotes: 0

Views: 72

Answers (1)

MrFlick
MrFlick

Reputation: 206232

The problem is your loop. You can't easily rbind to an empty data.frame with no columns. Luckly, that's completely avoidable.

bydir <- by(df,df$direction,function(x) {
    list(dir=x$direction[1], dist=sum(x$distance))})
do.call(rbind.data.frame, bydir)

is better. It would be even better to return a data.frame rather than a generic list

bydir <- by(df,df$direction,function(x) {
    data.frame(dir=x$direction[1], dist=sum(x$distance))})
do.call(rbind, bydir)

of course, by() is overkill for this particular example. A simple aggregate would do

aggregate(distance~direction, df, sum)

but I assume your real scenario is more complex.

Upvotes: 2

Related Questions