Reputation: 71
I have a data frame where some consecutive columns have the same name. I need to search for these, add their values in for each row, drop one column and replace the other with their sum. without previously knowing which patterns are duplicated, possibly having to compare one column name with the following to see if there's a match.
Can someone help?
Thanks in advance.
Upvotes: 5
Views: 5487
Reputation: 263362
> dfrm <- data.frame(a = 1:10, b= 1:10, cc= 1:10, dd=1:10, ee=1:10)
> names(dfrm) <- c("a", "a", "b", "b", "b")
> sapply(unique(names(dfrm)[duplicated(names(dfrm))]),
function(x) Reduce("+", dfrm[ , grep(x, names(dfrm))]) )
a b
[1,] 2 3
[2,] 4 6
[3,] 6 9
[4,] 8 12
[5,] 10 15
[6,] 12 18
[7,] 14 21
[8,] 16 24
[9,] 18 27
[10,] 20 30
EDIT 2: Using rowSums allows simplification of the first sapply argumentto just unique(names(dfrm))
at the expense of needing to remember to include drop=FALSE in "[":
sapply(unique(names(dfrm)),
function(x) rowSums( dfrm[ , grep(x, names(dfrm)), drop=FALSE]) )
To deal with NA's:
sapply(unique(names(dfrm)),
function(x) apply(dfrm[grep(x, names(dfrm))], 1,
function(y) if ( all(is.na(y)) ) {NA} else { sum(y, na.rm=TRUE) }
) )
(Edit note: addressed Tommy counter-example by putting unique around the names(.)[.] construction. The erroneous code was:
sapply(names(dfrm)[unique(duplicated(names(dfrm)))],
function(x) Reduce("+", dfrm[ , grep(x, names(dfrm))]) )
Upvotes: 7
Reputation: 55695
Here is my one liner
# transpose data frame, sum by group = rowname, transpose back.
t(rowsum(t(dfrm), group = rownames(t(dfrm))))
Upvotes: 4
Reputation: 121077
Some sample data.
dfr <- data.frame(
foo = rnorm(20),
bar = 1:20,
bar = runif(20),
check.names = FALSE
)
Method: Loop over unique column names; if there is only one of that name, then selecting all columns with that nme will return a vector, but if there are duplicates it will also be a data frame. Use rowSums
to sum over rows. (Duh. EDIT: Not quite as 'duh' as previously thought!) EDIT: lapply
returns a list, which we need to reform into a data frame, and finally we fix the names.sapply
avoids the need for the last step.
unique_col_names <- unique(colnames(dfr))
new_dfr <- sapply(unique_col_names, function(name)
{
subs <- dfr[, colnames(dfr) == name]
if(is.data.frame(subs))
rowSums(subs)
else
subs
})
Upvotes: 2
Reputation: 10437
One way is to identify duplcates using (surprise) the duplicated
function, and then loop through them to calculate the sums. Here is an example:
dat.dup <- data.frame(x=1:10, x=1:10, x=1:10, y=1:10, y=1:10, z=1:10, check.names=FALSE)
dups <- unique(names(dat.dup)[duplicated(names(dat.dup))])
for (i in dups) {
dat.dup[[i]] <- rowSums(dat.dup[names(dat.dup) == i])
}
dat <- dat.dup[!duplicated(names(dat.dup))]
Upvotes: 2