Reputation: 47
PROBLEM STATEMENT: Generating dummy variables based on values in multiple columns.
To assign values (more like dummy variables) to columns based on their presence in other “multiple columns". The following code uses data frames.
Explanation:
Code is given to explain the output desired.
set.seed(12345)
df<- data.frame(A1=c(1L,2L),A2=LETTERS[1:3],A3=round(rnorm(4),4),A4=1:12)
df
names= paste0("V",c(1:12))
df[,c(names)]=0
for ( i in 1:nrow(df)){ df[i,c(names)]=match(c(1:12),df[i,c("A1","A4")])}
df[,c(names)][!is.na(df[,c(names)])]=1
df[,c(names)][is.na(df[,c(names)])]=0
df
I would like to have suggestions for code using data table : = operator so that process can be faster. Thanks
Upvotes: 3
Views: 715
Reputation: 887521
We can use lapply
to loop the columns 'A1' and 'A4' of df
, compare with the values 1:12 with sapply
, Use Reduce
with |
and collapse the list
output to a single matrix. The +
is for converting logical matrix to binary format. In the last step we cbind
with the original dataset
cbind(df, +(Reduce('|', lapply(df[c(1,4)], function(x) sapply(1:12, '==', x)))))
Another base R
option without looping will be table
. We unlist
the columns of interest i.e. 'A1', 'A4', get the table
with 1:12
values, double negate (!!
) to make '0' values FALSE and all other TRUE, use +
to coerce the logical matrix to binary 1/0
, and cbind
with the original dataset.
subDF <- df[c('A1', 'A4')]
newdf <- cbind(df, +(!!table(rep(1:12, ncol(subDF)), unlist(subDF))))
colnames(newdf)[5:ncol(newdf)] <- paste0('V', 1:12)
newdf
# A1 A2 A3 A4 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#1 1 A 0.5855 1 1 0 0 0 0 0 0 0 0 0 0 0
#2 2 B 0.7095 2 0 1 0 0 0 0 0 0 0 0 0 0
#3 1 C -0.1093 3 1 0 1 0 0 0 0 0 0 0 0 0
#4 2 A -0.4535 4 0 1 0 1 0 0 0 0 0 0 0 0
#5 1 B 0.5855 5 1 0 0 0 1 0 0 0 0 0 0 0
#6 2 C 0.7095 6 0 1 0 0 0 1 0 0 0 0 0 0
#7 1 A -0.1093 7 1 0 0 0 0 0 1 0 0 0 0 0
#8 2 B -0.4535 8 0 1 0 0 0 0 0 1 0 0 0 0
#9 1 C 0.5855 9 1 0 0 0 0 0 0 0 1 0 0 0
#10 2 A 0.7095 10 0 1 0 0 0 0 0 0 0 1 0 0
#11 1 B -0.1093 11 1 0 0 0 0 0 0 0 0 0 1 0
#12 2 C -0.4535 12 0 1 0 0 0 0 0 0 0 0 0 1
We can also use data.table
. I am not sure whether this is very efficient as we do table
inside the data.table
. The approach would be to first convert the 'data.frame' to 'data.table' (setDT(df)
), unlist
the columns specified in the .SDcols
, get the seq_len
of number of rows (.N
) i.e. 1:12 in the example, replicate (rep
) it by the length
of 'nm1', and get the table
.
We create a data.table
from the table
class (split(tbl..
), by looping through the columns using a for
loop, we set
the values to binary 0/1
. The set
approach is efficient as it avoids the overhead of [.data.table
. Later, we can cbind
with the original dataset.
library(data.table)
nm1 <- c('A1', 'A4')
tbl <- setDT(df)[, table(rep(seq_len(.N),length(nm1)), unlist(.SD)), .SDcols=nm1]
dt1 <- setDT(split(tbl, col(tbl)))[]
for(j in seq_along(dt1)) {
set(dt1, i=NULL, j=j, value=+(!!dt1[[j]]))
}
cbind(df, dt1)
Upvotes: 3