Reputation: 23
I have a spreadsheet that have >100 columns, and many columns have the same names. I would like to merge those columns with the same names and row-sum the values in those columns. I think conditional execution, if(), should do it, but I'm stuck at writing the condition for identical column names? And what will be the function to merge and sum the columns? merge()? or rowsum()?
aa <- read.table()
if (colnames(aa) == ) merge/rowsum()
Thanks.
This is a sample of what it looks like now:
B C U B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
And this is what I hope to get: Reduction of the number of columns and sum the values when merging.
B C U
2 2 1
4 4 2
6 6 3
8 8 4
Upvotes: 0
Views: 1352
Reputation: 887741
Here is another option with melt/dcast
from data.table
. We convert the 'data.frame' to 'data.table' (setDT(df1)
), create a row number column ('rn'), melt
from 'wide' to 'long' format and then dcast
it to 'wide' by specifying the fun.aggregate
as sum
.
library(data.table)
setDT(df1)[, rn := 1:.N]
dcast(melt(df1, id.var= "rn"), rn ~variable, value.var="value", sum)[, rn:= NULL][]
# B C U
#1: 2 2 1
#2: 4 4 2
#3: 6 6 3
#4: 8 8 4
Upvotes: 1
Reputation: 35324
Using split()
, lapply()
, rowSums()
, and do.call()
/cbind()
:
do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4
Replacing the rowSums()
call with Reduce()
/`+`()
:
do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4
Replacing the index vector middleman with splitting the data.frame (as an unclassed list) directly:
do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4
library(microbenchmark);
bgoldst1 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
bgoldst2 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
bgoldst3 <- function(df) do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
sotos <- function(df) sapply(unique(names(df)), function(i)rowSums(df[names(df) == i]));
df <- data.frame(B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),U=c(1L,2L,3L,4L),B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),check.names=F);
ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE
microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst1(df) 245.473 258.3030 278.9499 272.4155 286.742 641.052 100
## bgoldst2(df) 156.949 166.3580 184.2206 171.7030 181.539 1042.618 100
## bgoldst3(df) 82.110 92.5875 100.9138 97.2915 107.128 170.207 100
## sotos(df) 200.997 211.9030 226.7977 223.6630 235.210 328.010 100
set.seed(1L);
NR <- 1e3L; NC <- 1e3L;
df <- setNames(nm=LETTERS[sample(seq_along(LETTERS),NC,T)],data.frame(replicate(NC,sample(seq_len(NR*3L),NR,T))));
ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE
microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst1(df) 11.070218 11.586182 12.745706 12.870209 13.234997 16.15929 100
## bgoldst2(df) 4.534402 4.680446 6.161428 6.097900 6.425697 44.83254 100
## bgoldst3(df) 3.430203 3.555505 5.355128 4.919931 5.219930 41.79279 100
## sotos(df) 19.953848 21.419628 22.713282 21.829533 22.280279 60.86525 100
Upvotes: 3
Reputation: 51582
One way to do it,
sapply(unique(names(df)), function(i)rowSums(df[names(df) == i]))
# B C U
#[1,] 2 2 1
#[2,] 4 4 2
#[3,] 6 6 3
#[4,] 8 8 4
Upvotes: 2