R. Sirkin
R. Sirkin

Reputation: 21

Concatenate varying number of columns by partial match (R)

first question on SO, though I've been lurking for awhile! I tried to do my due diligence and am getting closer to the answer.

I have a 300-column data frame that I want to merge down into about 10 columns, based on matching the pattern of the variable name. The raw data output gives me a column with the primary variable name (in the example, "before" and "after") and a number. In my "real" data, there are about 30 copies of each variable.

I want to combine every column that has "before" in its name, or "after," etc. I successfully created the variable "new" using data.table's syntax for this type of "calculated" column.

myTable2[, new := paste(before1, before2, sep = "")]

> myTable2
 herenow     before1 before2 before3  after1 after2 after3         new
1: 0.3399679      if     and   where     not   here  blank       ifand
2: 0.8181909     for      in      by through  blank  blank       forin
3: 0.2237681     and   where            mine  yours   ours    andwhere
4: 0.6161998     and   where              ha    hey    hon    andwhere
5: 0.7606252   fifth  eighth     and   where    not   beet fiftheighth
6: 0.5525105     and   where     not    fill           are    andwhere

But as you see, that is explicitly saying the columns I want to combine. I want to flexibly combine, such that if I have 31 copies of one variable and 86 of another I don't a) have to know that or b) have to type that. I just want to match based on the base variable name (e.g. "before") and combine the columns.

I tried to go to the next level using grep...

> newvar2 <- paste(grep("before", colnames(myTable2), value = TRUE), collapse = "")
> newvar2
[1] "before1before2before3"

and that confirmed to me that I can combine a variable number of values with grep pattern matching.

Next step: How do I combine these two steps such that the

new := paste(etc....)

takes the grep step as its argument and combines all those columns whose names match the pattern? This is what I want:

 herenow        before_Final    after_Final
1: 0.339967856  ifandwhere      nothereblank
2: 0.818190875  forinby         throughblankblank
3: 0.223768051  andwhere        mineyoursours
4: 0.616199835  andwhere        haheyhon
5: 0.760625218  fiftheighthand  wherenotbeet
6: 0.552510532  andwherenot     fillare

I am working on learning more about vectorization, but if I could even make a list of the variable types I want to combine (e.g. before, after, between) and then run through those, possibly in a loop, that would be great! So something like

finalVarNames <- c("Before_final", "After_final", "Between_final")
whatToMatch <- c("before", "after", "between")

(For loop here...)

myTable2[, finalVarNames[i] := paste(grep(whatToMatch[i], myTable2, value = TRUE), collapse = "")]

I know that syntax isn't right, likely in the second "myTable2" reference before the value argument. This code does create the new variable successfully, but it is blank. How do I get the concatenated group of grep-matching variables into it?

Thank you for any help you can give!

Upvotes: 1

Views: 464

Answers (1)

akuiper
akuiper

Reputation: 214957

You can use the Reduce function to paste selected columns together via specifying the columns by grep in the .SD syntax. Here is an example of getting the results using data.table package:

library(stringi); library(data.table)
myTable2[, paste(stri_trans_totitle(whatToMatch), "final", sep = "_") := 
           lapply(whatToMatch, function(wtm) Reduce(function(x,y) paste(x, y, sep = ""), 
                                             .SD[, grep(wtm, names(myTable2)), with = F]))]

myTable2
#      herenow before1 before2 before3  after1 after2 after3   Before_final       After_final
# 1: 0.3399679      if     and   where     not   here  blank     ifandwhere      nothereblank
# 2: 0.8181909     for      in      by through  blank  blank        forinby throughblankblank
# 3: 0.2237681     and   where            mine  yours   ours       andwhere     mineyoursours
# 4: 0.6161998     and   where              ha    hey    hon       andwhere          haheyhon
# 5: 0.7606252   fifth  eighth     and   where    not   beet fiftheighthand      wherenotbeet
# 6: 0.5525105     and   where     not    fill           are    andwherenot           filler

Some benchmark of do.call and Reduce:

dim(myTable2)
# [1] 1572864       9

reduce <- function() myTable2[, paste(stri_trans_totitle(whatToMatch[1:2]), "final", sep = "_") := lapply(whatToMatch[1:2], function(wtm) Reduce(function(x,y) paste(x, y, sep = ""), .SD[, grep(wtm, names(myTable2)), with = F]))]    
docall <- function() myTable2[, paste(stri_trans_totitle(whatToMatch[1:2]), "final", sep = "_") := lapply(whatToMatch[1:2], function(wtm) do.call(paste, c(sep = "", .SD[, grep(wtm, names(myTable2)), with = F])))]

microbenchmark::microbenchmark(docall(), reduce(), times = 10)
# Unit: milliseconds
#     expr      min        lq      mean    median        uq       max neval
# docall() 707.7818  722.6037  767.8923  737.6272  852.4909  868.8202    10
# reduce() 999.4925 1009.5146 1026.6200 1020.4637 1046.7073 1067.7479    10

Upvotes: 2

Related Questions