Reputation: 21
first question on SO, though I've been lurking for awhile! I tried to do my due diligence and am getting closer to the answer.
I have a 300-column data frame that I want to merge down into about 10 columns, based on matching the pattern of the variable name. The raw data output gives me a column with the primary variable name (in the example, "before" and "after") and a number. In my "real" data, there are about 30 copies of each variable.
I want to combine every column that has "before" in its name, or "after," etc. I successfully created the variable "new" using data.table's syntax for this type of "calculated" column.
myTable2[, new := paste(before1, before2, sep = "")]
> myTable2
herenow before1 before2 before3 after1 after2 after3 new
1: 0.3399679 if and where not here blank ifand
2: 0.8181909 for in by through blank blank forin
3: 0.2237681 and where mine yours ours andwhere
4: 0.6161998 and where ha hey hon andwhere
5: 0.7606252 fifth eighth and where not beet fiftheighth
6: 0.5525105 and where not fill are andwhere
But as you see, that is explicitly saying the columns I want to combine. I want to flexibly combine, such that if I have 31 copies of one variable and 86 of another I don't a) have to know that or b) have to type that. I just want to match based on the base variable name (e.g. "before") and combine the columns.
I tried to go to the next level using grep...
> newvar2 <- paste(grep("before", colnames(myTable2), value = TRUE), collapse = "")
> newvar2
[1] "before1before2before3"
and that confirmed to me that I can combine a variable number of values with grep pattern matching.
Next step: How do I combine these two steps such that the
new := paste(etc....)
takes the grep step as its argument and combines all those columns whose names match the pattern? This is what I want:
herenow before_Final after_Final
1: 0.339967856 ifandwhere nothereblank
2: 0.818190875 forinby throughblankblank
3: 0.223768051 andwhere mineyoursours
4: 0.616199835 andwhere haheyhon
5: 0.760625218 fiftheighthand wherenotbeet
6: 0.552510532 andwherenot fillare
I am working on learning more about vectorization, but if I could even make a list of the variable types I want to combine (e.g. before, after, between) and then run through those, possibly in a loop, that would be great! So something like
finalVarNames <- c("Before_final", "After_final", "Between_final")
whatToMatch <- c("before", "after", "between")
(For loop here...)
myTable2[, finalVarNames[i] := paste(grep(whatToMatch[i], myTable2, value = TRUE), collapse = "")]
I know that syntax isn't right, likely in the second "myTable2" reference before the value argument. This code does create the new variable successfully, but it is blank. How do I get the concatenated group of grep-matching variables into it?
Thank you for any help you can give!
Upvotes: 1
Views: 464
Reputation: 214957
You can use the Reduce
function to paste selected columns together via specifying the columns by grep
in the .SD
syntax. Here is an example of getting the results using data.table
package:
library(stringi); library(data.table)
myTable2[, paste(stri_trans_totitle(whatToMatch), "final", sep = "_") :=
lapply(whatToMatch, function(wtm) Reduce(function(x,y) paste(x, y, sep = ""),
.SD[, grep(wtm, names(myTable2)), with = F]))]
myTable2
# herenow before1 before2 before3 after1 after2 after3 Before_final After_final
# 1: 0.3399679 if and where not here blank ifandwhere nothereblank
# 2: 0.8181909 for in by through blank blank forinby throughblankblank
# 3: 0.2237681 and where mine yours ours andwhere mineyoursours
# 4: 0.6161998 and where ha hey hon andwhere haheyhon
# 5: 0.7606252 fifth eighth and where not beet fiftheighthand wherenotbeet
# 6: 0.5525105 and where not fill are andwherenot filler
Some benchmark of do.call
and Reduce
:
dim(myTable2)
# [1] 1572864 9
reduce <- function() myTable2[, paste(stri_trans_totitle(whatToMatch[1:2]), "final", sep = "_") := lapply(whatToMatch[1:2], function(wtm) Reduce(function(x,y) paste(x, y, sep = ""), .SD[, grep(wtm, names(myTable2)), with = F]))]
docall <- function() myTable2[, paste(stri_trans_totitle(whatToMatch[1:2]), "final", sep = "_") := lapply(whatToMatch[1:2], function(wtm) do.call(paste, c(sep = "", .SD[, grep(wtm, names(myTable2)), with = F])))]
microbenchmark::microbenchmark(docall(), reduce(), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# docall() 707.7818 722.6037 767.8923 737.6272 852.4909 868.8202 10
# reduce() 999.4925 1009.5146 1026.6200 1020.4637 1046.7073 1067.7479 10
Upvotes: 2