Reputation: 455
I have a data.frame mydata
like this
V1 V2 V3 V4 V5
1 a b a
2 a b c
3 a b d
4 x y h
5 x y k e
I want to group it by the columns V1
and V2
, and delete the ""
string in the other columns
the result should like this
V1 V2 V3 V4 V5
1 a b a c d
2 x y h k e
is their a efficient way to do this using the dplyr
package? Thank you very much.
Upvotes: 1
Views: 103
Reputation: 6213
Using base R, if that's of interest
x <- data.frame(V1 = c(rep("a", 3), "x", "x"),
V2 = c(rep("b", 3), "y", "y"),
V3= c("a", "", "", "h", ""),
V4 = c("", "c", "", "", "k"),
V5 = c(rep("", 2), "d", "", "e"))
temp <- lapply(x[], function(y) as.character(unique(y[y != ""])))
data.frame(do.call(cbind,temp))
V1 V2 V3 V4 V5
1 a b a c d
2 x y h k e
Upvotes: 3
Reputation: 887851
We can use dplyr/tidyr
. We reshape the data from 'wide' to 'long' using gather
, remove the blank elements in the 'Val' column with filter
, and reshape it back to 'wide' format with spread
.
library(dplyr)
library(tidyr)
gather(mydata, Var, Val, V3:V5) %>%
filter(Val!='') %>%
spread(Var, Val)
# V1 V2 V3 V4 V5
#1 a b a c d
#2 x y h k e
Or another approach using only dplyr
(if the number of non-blank values are the same across each groups) would be to group by 'V1', 'V2', and use summarise_each
to select only the elements that are not blank (.[.!='']
)
mydata %>%
group_by(V1, V2) %>%
summarise_each(funs(.[.!='']))
# V1 V2 V3 V4 V5
#1 a b a c d
#2 x y h k e
We can also use data.table
to do this. We convert the 'data.frame' to 'data.table' (setDT(mydata)
), grouped by 'V1', 'V2', we loop through the other columns (lapply(.SD, ...)
) and subset the elements that are not blank.
library(data.table)
setDT(mydata)[,lapply(.SD, function(x) x[x!='']) ,.(V1, V2)]
# V1 V2 V3 V4 V5
#1: a b a c d
#2: x y h k e
Similar approach using aggregate
from base R
is
aggregate(.~V1+V2, mydata, FUN=function(x) x[x!=''])
# V1 V2 V3 V4 V5
#1 a b a c d
#2 x y h k e
mydata <- structure(list(V1 = c("a", "a", "a", "x", "x"),
V2 = c("b", "b",
"b", "y", "y"), V3 = c("a", "", "", "h", ""), V4 = c("", "c",
"", "", "k"), V5 = c("", "", "d", "", "e")), .Names = c("V1",
"V2", "V3", "V4", "V5"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
Upvotes: 2