DaleSteyn
DaleSteyn

Reputation: 47

Row-wise assignment of values to multiple columns

PROBLEM STATEMENT: Generating dummy variables based on values in multiple columns.

To assign values (more like dummy variables) to columns based on their presence in other “multiple columns". The following code uses data frames.

Explanation:

Code is given to explain the output desired.

set.seed(12345)
df<- data.frame(A1=c(1L,2L),A2=LETTERS[1:3],A3=round(rnorm(4),4),A4=1:12)
df
names= paste0("V",c(1:12))
df[,c(names)]=0
for ( i in 1:nrow(df)){ df[i,c(names)]=match(c(1:12),df[i,c("A1","A4")])}
df[,c(names)][!is.na(df[,c(names)])]=1
df[,c(names)][is.na(df[,c(names)])]=0
df

I would like to have suggestions for code using data table : = operator so that process can be faster. Thanks

Upvotes: 3

Views: 715

Answers (1)

akrun
akrun

Reputation: 887521

We can use lapply to loop the columns 'A1' and 'A4' of df, compare with the values 1:12 with sapply, Use Reduce with | and collapse the list output to a single matrix. The + is for converting logical matrix to binary format. In the last step we cbind with the original dataset

cbind(df, +(Reduce('|', lapply(df[c(1,4)], function(x) sapply(1:12, '==', x)))))

Another base R option without looping will be table. We unlist the columns of interest i.e. 'A1', 'A4', get the table with 1:12 values, double negate (!!) to make '0' values FALSE and all other TRUE, use + to coerce the logical matrix to binary 1/0, and cbind with the original dataset.

subDF <- df[c('A1', 'A4')]
newdf <- cbind(df, +(!!table(rep(1:12, ncol(subDF)), unlist(subDF))))
colnames(newdf)[5:ncol(newdf)] <- paste0('V', 1:12)
newdf
#    A1 A2      A3 A4 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#1   1  A  0.5855  1  1  0  0  0  0  0  0  0  0   0   0   0
#2   2  B  0.7095  2  0  1  0  0  0  0  0  0  0   0   0   0
#3   1  C -0.1093  3  1  0  1  0  0  0  0  0  0   0   0   0
#4   2  A -0.4535  4  0  1  0  1  0  0  0  0  0   0   0   0
#5   1  B  0.5855  5  1  0  0  0  1  0  0  0  0   0   0   0
#6   2  C  0.7095  6  0  1  0  0  0  1  0  0  0   0   0   0
#7   1  A -0.1093  7  1  0  0  0  0  0  1  0  0   0   0   0
#8   2  B -0.4535  8  0  1  0  0  0  0  0  1  0   0   0   0
#9   1  C  0.5855  9  1  0  0  0  0  0  0  0  1   0   0   0
#10  2  A  0.7095 10  0  1  0  0  0  0  0  0  0   1   0   0
#11  1  B -0.1093 11  1  0  0  0  0  0  0  0  0   0   1   0
#12  2  C -0.4535 12  0  1  0  0  0  0  0  0  0   0   0   1

We can also use data.table. I am not sure whether this is very efficient as we do table inside the data.table. The approach would be to first convert the 'data.frame' to 'data.table' (setDT(df)), unlist the columns specified in the .SDcols, get the seq_len of number of rows (.N) i.e. 1:12 in the example, replicate (rep) it by the length of 'nm1', and get the table.

We create a data.table from the table class (split(tbl..), by looping through the columns using a for loop, we set the values to binary 0/1. The set approach is efficient as it avoids the overhead of [.data.table. Later, we can cbind with the original dataset.

library(data.table)
nm1 <- c('A1', 'A4')
tbl <- setDT(df)[, table(rep(seq_len(.N),length(nm1)), unlist(.SD)), .SDcols=nm1]

dt1 <- setDT(split(tbl, col(tbl)))[]
for(j in seq_along(dt1)) {
       set(dt1, i=NULL, j=j, value=+(!!dt1[[j]]))
}

cbind(df, dt1)

Upvotes: 3

Related Questions