BigFinger
BigFinger

Reputation: 1043

Computing on multiple column names in a data.table

I have a sparse data table that looks like this:

 data = data.table(
    var1 = c("a","",""),
    var2 = c("","","c"),
    var3 = c("a","b",""),
    var4 = c("","b","")
)
      var1 var2 var3 var4
    1:    a         a     
    2:              b    b
    3:         c     

I would like to add a column that contains a string of zeros and ones indicating which variables are present in any row, like this:

  var1 var2 var3 var4  concat
1:    a         a      1|0|1|0
2:              b    b 0|0|1|1
3:         c           0|1|0|0

I can get to this with the following command:

data[, concat := paste(
           as.integer(var1 != ""),
           as.integer(var2 != ""),
           as.integer(var3 != ""),
           as.integer(var4 != ""),
           sep = "|")]

However, if I have hundreds of variables, I would rather use some sort of computation to get to the desired expression. Perhaps something based on paste0("var",1:4), or at least a vector of column names. Any suggestions?

Upvotes: 5

Views: 186

Answers (5)

BigFinger
BigFinger

Reputation: 1043

Thanks for the many and so diverse solutions. I am impressed!

I did some benchmarking on my large dataset to compare running time of several of the different approaches. Here is what I found:

data[ , concat := apply(.SD, 1, function(x) paste(+(x == ""), collapse = "|"))]

Time: 6 min, 41 s

data[, concat := do.call(paste, c(lapply(.SD, function(x) (x!="")+0 ), sep="|")) ]

Time: 10 min, 26 s

data[,concat := paste0(as.integer(.SD != ""), collapse = "|"), by = 1:nrow(data)]

Time: > 40 min (manually killed)

Upvotes: 2

thelatemail
thelatemail

Reputation: 93813

Variation not requiring any grouping by each row or apply-ing on each row.

data[, concat := do.call(paste, c(lapply(.SD, function(x) (x!="")+0 ), sep="|")) ]

#   var1 var2 var3 var4  concat
#1:    a         a      1|0|1|0
#2:              b    b 0|0|1|1
#3:         c           0|1|0|0

Upvotes: 4

akuiper
akuiper

Reputation: 214957

Another option is to group the data by row and paste each row together:

data[,concat := paste0(as.integer(.SD != ""), collapse = "|"), by = 1:nrow(data)]
data
#   var1 var2 var3 var4  concat
#1:    a         a      1|0|1|0
#2:              b    b 0|0|1|1
#3:         c           0|1|0|0

Upvotes: 1

MichaelChirico
MichaelChirico

Reputation: 34703

Same basic approach as above:

data[ , concat := apply(.SD, 1, function(x) paste(+(x == ""), collapse = "|"))][]
#    var1 var2 var3 var4  concat
# 1:    a         a      0|1|0|1
# 2:              b    b 1|1|0|0
# 3:         c           1|0|1|1

Upvotes: 4

TomNash
TomNash

Reputation: 3288

 data$concat <- apply(apply(data, 2, function(x) ifelse(x == "", 0, 1)), 1, function(x) paste(x, collapse="|"))

Breakdown:

1) For each column in data, check if element is empty, if so return 0, else 1

apply(data, 2, function(x) ifelse(x == "", 0, 1))

Let's call the return from (1) the variable concat. For each row of concat, paste everything together with a pipe (|) separating them. Set the new column of data to equal this.

apply(concat, 1, function(x) paste(x, collapse="|"))

Upvotes: 2

Related Questions