Zihu Guo
Zihu Guo

Reputation: 455

how to delete duplicated duplicated of each column after group_by() using `dplyr` package

I have a data.frame mydata like this

   V1 V2 V3 V4 V5
1  a  b  a      
2  a  b     c   
3  a  b        d
4  x  y  h      
5  x  y     k  e

I want to group it by the columns V1and V2, and delete the "" string in the other columns

the result should like this

  V1 V2 V3 V4 V5
1  a  b  a  c  d
2  x  y  h  k  e

is their a efficient way to do this using the dplyr package? Thank you very much.

Upvotes: 1

Views: 103

Answers (2)

Whitebeard
Whitebeard

Reputation: 6213

Using base R, if that's of interest

x <- data.frame(V1 = c(rep("a", 3), "x", "x"), 
    V2 = c(rep("b", 3), "y", "y"), 
    V3= c("a", "", "", "h", ""), 
    V4 = c("", "c", "", "", "k"), 
    V5 = c(rep("", 2), "d", "", "e"))

temp <- lapply(x[], function(y) as.character(unique(y[y != ""])))
data.frame(do.call(cbind,temp))

  V1 V2 V3 V4 V5
1  a  b  a  c  d
2  x  y  h  k  e

Upvotes: 3

akrun
akrun

Reputation: 887851

We can use dplyr/tidyr. We reshape the data from 'wide' to 'long' using gather, remove the blank elements in the 'Val' column with filter, and reshape it back to 'wide' format with spread.

library(dplyr)
library(tidyr) 
gather(mydata, Var, Val, V3:V5) %>% 
              filter(Val!='') %>% 
              spread(Var, Val)
#   V1 V2 V3 V4 V5
#1  a  b  a  c  d
#2  x  y  h  k  e

Or another approach using only dplyr (if the number of non-blank values are the same across each groups) would be to group by 'V1', 'V2', and use summarise_each to select only the elements that are not blank (.[.!=''])

 mydata %>%
       group_by(V1, V2) %>% 
       summarise_each(funs(.[.!='']))
 #  V1 V2 V3 V4 V5
 #1  a  b  a  c  d
 #2  x  y  h  k  e

We can also use data.table to do this. We convert the 'data.frame' to 'data.table' (setDT(mydata)), grouped by 'V1', 'V2', we loop through the other columns (lapply(.SD, ...)) and subset the elements that are not blank.

 library(data.table)
 setDT(mydata)[,lapply(.SD, function(x) x[x!='']) ,.(V1, V2)]
 #   V1 V2 V3 V4 V5
 #1:  a  b  a  c  d
 #2:  x  y  h  k  e

Similar approach using aggregate from base R is

 aggregate(.~V1+V2, mydata, FUN=function(x) x[x!=''])
 #  V1 V2 V3 V4 V5
 #1  a  b  a  c  d
 #2  x  y  h  k  e

data

mydata <- structure(list(V1 = c("a", "a", "a", "x", "x"),
V2 = c("b", "b", 
"b", "y", "y"), V3 = c("a", "", "", "h", ""), V4 = c("", "c", 
"", "", "k"), V5 = c("", "", "d", "", "e")), .Names = c("V1", 
"V2", "V3", "V4", "V5"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

Upvotes: 2

Related Questions