how to delete duplicated duplicated of each column after group_by() using `dplyr` package

Question

I have a data.frame mydata like this

   V1 V2 V3 V4 V5
1  a  b  a      
2  a  b     c   
3  a  b        d
4  x  y  h      
5  x  y     k  e

I want to group it by the columns V1and V2, and delete the "" string in the other columns

the result should like this

  V1 V2 V3 V4 V5
1  a  b  a  c  d
2  x  y  h  k  e

is their a efficient way to do this using the dplyr package? Thank you very much.

akrun · Accepted Answer

We can use dplyr/tidyr. We reshape the data from 'wide' to 'long' using gather, remove the blank elements in the 'Val' column with filter, and reshape it back to 'wide' format with spread.

library(dplyr)
library(tidyr) 
gather(mydata, Var, Val, V3:V5) %>% 
              filter(Val!='') %>% 
              spread(Var, Val)
#   V1 V2 V3 V4 V5
#1  a  b  a  c  d
#2  x  y  h  k  e

Or another approach using only dplyr (if the number of non-blank values are the same across each groups) would be to group by 'V1', 'V2', and use summarise_each to select only the elements that are not blank (.[.!=''])

 mydata %>%
       group_by(V1, V2) %>% 
       summarise_each(funs(.[.!='']))
 #  V1 V2 V3 V4 V5
 #1  a  b  a  c  d
 #2  x  y  h  k  e

We can also use data.table to do this. We convert the 'data.frame' to 'data.table' (setDT(mydata)), grouped by 'V1', 'V2', we loop through the other columns (lapply(.SD, ...)) and subset the elements that are not blank.

 library(data.table)
 setDT(mydata)[,lapply(.SD, function(x) x[x!='']) ,.(V1, V2)]
 #   V1 V2 V3 V4 V5
 #1:  a  b  a  c  d
 #2:  x  y  h  k  e

Similar approach using aggregate from base R is

 aggregate(.~V1+V2, mydata, FUN=function(x) x[x!=''])
 #  V1 V2 V3 V4 V5
 #1  a  b  a  c  d
 #2  x  y  h  k  e

data

mydata <- structure(list(V1 = c("a", "a", "a", "x", "x"),
V2 = c("b", "b", 
"b", "y", "y"), V3 = c("a", "", "", "h", ""), V4 = c("", "c", 
"", "", "k"), V5 = c("", "", "d", "", "e")), .Names = c("V1", 
"V2", "V3", "V4", "V5"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

how to delete duplicated duplicated of each column after group_by() using `dplyr` package

Answers (2)

data

Related Questions