James Marriott
James Marriott

Reputation: 47

Collapse Column in R

I have searched as best I could, and part of my problem is that I'm really not sure exactly what to ask. Here's my data, and how I want it to end up:

Now:

john    a Yes
john    b No
john    c No
Rebekah a Yes
Rebekah d No
Chase   c Yes
Chase   d No
Chase   e No
Chase   f No

How I'd like it to be:

john     a,b,c    Yes
Rebekah  a,d      Yes
Chase    c,d,e,f  Yes

Notice that the 3rd column says yes when it is the first row with that particular value in the 1st column. The 3rd row isn't necessary, I was just using it, thinking that I would try to do this all with if and for statements, but I thought that would be so inefficient. Is there any way to make this work efficiently?

Upvotes: 3

Views: 376

Answers (2)

Veerendra Gadekar
Veerendra Gadekar

Reputation: 4472

Another option would be (using data mentioned by @bgoldst)

library('dplyr')

out = df %>% 
      group_by(a) %>% 
      summarize(b = paste(unique(c(b)), collapse=","), c = "yes")

#> out
#Source: local data frame [3 x 3]

#        a       b   c
#1   Chase c,d,e,f yes
#2 Rebekah     a,d yes
#3    john   a,b,c yes

using data.table

out = setDT(df)[, .(b = paste(unique(b),  collapse=','), c = "yes"), by = .(a)]

#> out
#         a       b   c
#1:    john   a,b,c yes
#2: Rebekah     a,d yes
#3:   Chase c,d,e,f yes

Upvotes: 6

bgoldst
bgoldst

Reputation: 35314

You can use by() to do this:

df <- data.frame(a=c('john','john','john','Rebekah','Rebekah','Chase','Chase','Chase','Chase'), b=c('a','b','c','a','d','c','d','e','f'), c=c('Yes','No','No','Yes','No','Yes','No','No','No'), stringsAsFactors=F );
do.call(rbind,by(df,df$a,function(x) data.frame(a=x$a[1],b=paste0(x$b,collapse=','),c=x$c[1],stringsAsFactors=F)));
##               a       b   c
## Chase     Chase c,d,e,f Yes
## john       john   a,b,c Yes
## Rebekah Rebekah     a,d Yes

Edit: Here's another approach, using independent aggregations using tapply():

key <- unique(df$a);
data.frame(a=key,b=tapply(df$b,df$a,paste,collapse=',')[key],c=tapply(df$c,df$a,`[`,1)[key]);
##               a       b   c
## john       john   a,b,c Yes
## Rebekah Rebekah     a,d Yes
## Chase     Chase c,d,e,f Yes

Edit: And yet another approach, merge()ing the result of a couple of aggregate() calls:

merge(aggregate(b~a,df,paste,collapse=','),aggregate(c~a,df,`[`,1));
##         a       b   c
## 1   Chase c,d,e,f Yes
## 2    john   a,b,c Yes
## 3 Rebekah     a,d Yes

Upvotes: 4

Related Questions