Soon Hwee N
Soon Hwee N

Reputation: 23

R: merging columns and the values if they have the same column name

I have a spreadsheet that have >100 columns, and many columns have the same names. I would like to merge those columns with the same names and row-sum the values in those columns. I think conditional execution, if(), should do it, but I'm stuck at writing the condition for identical column names? And what will be the function to merge and sum the columns? merge()? or rowsum()?

aa <- read.table()

if (colnames(aa) == ) merge/rowsum()

Thanks.

This is a sample of what it looks like now:

B C U B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4

And this is what I hope to get: Reduction of the number of columns and sum the values when merging.

B C U
2 2 1
4 4 2
6 6 3
8 8 4

Upvotes: 0

Views: 1352

Answers (3)

akrun
akrun

Reputation: 887741

Here is another option with melt/dcast from data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), create a row number column ('rn'), melt from 'wide' to 'long' format and then dcast it to 'wide' by specifying the fun.aggregate as sum.

library(data.table)
setDT(df1)[, rn :=  1:.N]
dcast(melt(df1, id.var= "rn"), rn ~variable, value.var="value", sum)[, rn:= NULL][]
#   B C U
#1: 2 2 1
#2: 4 4 2
#3: 6 6 3
#4: 8 8 4

Upvotes: 1

bgoldst
bgoldst

Reputation: 35324

Solution 1

Using split(), lapply(), rowSums(), and do.call()/cbind():

do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
##      B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Solution 2

Replacing the rowSums() call with Reduce()/`+`():

do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
##      B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Solution 3

Replacing the index vector middleman with splitting the data.frame (as an unclassed list) directly:

do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
##      B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Benchmarking

library(microbenchmark);

bgoldst1 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
bgoldst2 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
bgoldst3 <- function(df) do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
sotos <- function(df) sapply(unique(names(df)), function(i)rowSums(df[names(df) == i]));

df <- data.frame(B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),U=c(1L,2L,3L,4L),B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),check.names=F);

ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE

microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: microseconds
##          expr     min       lq     mean   median      uq      max neval
##  bgoldst1(df) 245.473 258.3030 278.9499 272.4155 286.742  641.052   100
##  bgoldst2(df) 156.949 166.3580 184.2206 171.7030 181.539 1042.618   100
##  bgoldst3(df)  82.110  92.5875 100.9138  97.2915 107.128  170.207   100
##     sotos(df) 200.997 211.9030 226.7977 223.6630 235.210  328.010   100

set.seed(1L);
NR <- 1e3L; NC <- 1e3L;
df <- setNames(nm=LETTERS[sample(seq_along(LETTERS),NC,T)],data.frame(replicate(NC,sample(seq_len(NR*3L),NR,T))));

ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE

microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: milliseconds
##          expr       min        lq      mean    median        uq      max neval
##  bgoldst1(df) 11.070218 11.586182 12.745706 12.870209 13.234997 16.15929   100
##  bgoldst2(df)  4.534402  4.680446  6.161428  6.097900  6.425697 44.83254   100
##  bgoldst3(df)  3.430203  3.555505  5.355128  4.919931  5.219930 41.79279   100
##     sotos(df) 19.953848 21.419628 22.713282 21.829533 22.280279 60.86525   100

Upvotes: 3

Sotos
Sotos

Reputation: 51582

One way to do it,

sapply(unique(names(df)), function(i)rowSums(df[names(df) == i]))

#     B C U
#[1,] 2 2 1
#[2,] 4 4 2
#[3,] 6 6 3
#[4,] 8 8 4

Upvotes: 2

Related Questions