Reputation: 82
I'm trying to come up with multiple columns based on a column (path) in data table. My dataset is currently 6 millions rows.
EDIT: Included minimal dataset which is naive
set.seed(24);
DATA <- data.frame(path=paste0(sample(LETTERS[1:3], 25, replace=TRUE),">",sample(LETTERS[1:3], 25, replace=TRUE)), value=rnorm(25))
Following is code that I'm currently using (of course I'm not grepl'ing alphabets):
for (i in 1:nrow(DATA)) {
if(grepl("A", DATA$path[i])){DATA$A[i]=1}else{DATA$A[i]=0}
if(grepl("B", DATA$path[i])){DATA$B[i]=1}else{DATA$B[i]=0}
if(grepl("C", DATA$path[i])){DATA$C[i]=1}else{DATA$C[i]=0}
}
An older version of the code I was using is:
DATA$A <- sapply(DATA$path, function(x) { if(grepl("A", x)){1}else{0}})
DATA$B <- sapply(DATA$path, function(x) { if(grepl("B", x)){1}else{0}})
for every column that I want to output.
but this is very inefficient as it sapply too many times.
My question is: is there a more efficient way to do the same thing? or am I stuck with the first code chunk?
Thanks in advance!
Upvotes: 1
Views: 81
Reputation: 886938
We could split
the 'path' column with >
and get the unique
elements in that column ('Un1'). Loop over 'Un1' and use grepl
to find the match.
Un1 <- sort(unique(unlist(strsplit(as.character(DATA$path), '>'))))
DATA[Un1] <- lapply(Un1, function(x) as.integer(grepl(x, DATA$path)))
Or another option is using mtabulate
from qdapTools
on the strsplit
output, negate (!
) to convert 0
to 'TRUE' and all other values to 'FALSE', negate again so that 'FALSE' becomes 'TRUE' and viceversa, wrap it with +(
to coerce the logical to integer. We can also use as.integer
or +0L
etc for this.
library(qdapTools)
cbind(DATA, +(!!mtabulate(strsplit(as.character(DATA$path), '>')) ))
Or after splitting the 'path' column, we can loop over the columns, apply model.matrix
and use |
with Reduce
d1 <- do.call(rbind.data.frame,strsplit(as.character(DATA$path), '>'))
names(d1) <- paste0('path', 1:2)
cbind(DATA, +(Reduce(`|`,lapply(d1, function(x) model.matrix(~0+x)))))
Upvotes: 2