Bg1850
Bg1850

Reputation: 3082

Need to change data.table columns' value from "Yes" ,"No" to 1,0

I have a datatable object of around 0.8 million rows. Some columns in them have Yes/No values. I want to change it to 1/0 respectively. Here workable_data is my data.table object:

My code for changing (which seems too simplistic):

yes_No_to_1 <- function(x){
  sapply(x,function(x){
    if(x =='Yes'){
      1
      }else{
        0}
    })
} 

sapply(workable_data[,product,with=FALSE],yes_No_to_1)

However there are two problem s: (1) it's taking a long time and (2) I want to change the values permanently in workable_data in a data.table way.

You can reproduce sample data with:

product <- c("HasProduct1","HasProduct2","HasProduct3","HasProduct4","HasProduct5","HasProduct6","HasProduct7","HasProduct8","HasProduct9","HasProduct10")
workable_data <- as.data.table(data.frame(x=sample(1:100),sapply(product,function(x){x <-sample(c("Yes","No"),10000,replace = T)})))

I assume there is a better way to achieve this but I cant seem to figure it out and I would also appreciate any advise on optimization of my function (that is going to be the last resort). The time taken in this sample data is, which is pretty bad for 10k rows.

  user    system elapsed 
   6.21    0.00    6.25 

Upvotes: 2

Views: 1769

Answers (3)

Jaap
Jaap

Reputation: 83215

You could do this with the set functionality in data.table:

1: Create a vector of columnnames in which you want to change the Yes to 1 and the No to 0 (like @Frank said in the comments)

cols <- grep("^HasProduct", names(DT), value = TRUE)

2: Change the values with the following for(...) set(...) implementation (as rightfully pointed out by @Arun in the comments, you can also use as.integer instead of just +):

for (col in cols) set(DT, j = col, value = +(DT[[col]] == "Yes"))

this results in:

> DT
        x HasProduct1 HasProduct2 HasProduct3 HasProduct4 HasProduct5 HasProduct6 HasProduct7 HasProduct8 HasProduct9 HasProduct10
    1: 23           0           1           0           1           0           0           1           0           0            0
    2: 74           1           0           1           1           0           1           1           1           1            1
    3: 35           1           1           0           0           0           1           1           1           0            1
    4:  7           1           1           1           1           0           1           1           0           0            1
    5: 92           0           1           1           1           1           1           0           1           1            0
   ---                                                                                                                            
 9996: 56           0           0           1           0           1           0           0           0           1            0
 9997: 59           1           0           1           1           0           1           1           1           1            0
 9998: 85           0           1           0           1           1           1           1           1           1            1
 9999: 93           1           0           0           0           0           0           0           0           1            1
10000: 29           0           1           1           0           0           1           0           1           1            1

Timings:

   user  system elapsed 
  0.007   0.000   0.007 

Used data:

set.seed(654)
product <- c("HasProduct1","HasProduct2","HasProduct3","HasProduct4","HasProduct5","HasProduct6","HasProduct7","HasProduct8","HasProduct9","HasProduct10")
DT <- as.data.table(data.frame(x=sample(1:100),sapply(product,function(x){x <-sample(c("Yes","No"),10000,replace = T)})))

Upvotes: 4

Max Moldovan
Max Moldovan

Reputation: 139

In R, factors are internally treated as 1,2,3... integers corresponding to each level within a factor. "No" comes in front of "Yes" in alphabetic order, so "No" == 1L, "Yes" == 2L ("L" is just setting a number as an integer). We can retain integers from a factor and subtract 1L to arrive to "No" == 0L, "Yes" == 1L.

The advantage of this approach as compared to one of Jaap above is that factors with arbitrary level numbers can be transformed in the same way. The computational times are largely equivalent, especially for larger tables.

require(data.table)

# generate example data
set.seed(2020)
product <- c("HasProduct1","HasProduct2","HasProduct3","HasProduct4","HasProduct5","HasProduct6","HasProduct7","HasProduct8","HasProduct9","HasProduct10")
DT <- as.data.table(data.frame(x=sample(1:100),sapply(product,function(x){x <-sample(c("Yes","No"),10000,replace = T)})))

# define the columns to process
cols <- grep("^HasProduct", names(DT), value=TRUE)

# this is another data.table way to do it
DT[, (cols) := lapply(.SD, function(x) return(as.integer(as.factor(x)) - 1L)), .SDcols = cols]

Upvotes: 0

dc3
dc3

Reputation: 188

Something like this should work, though I haven't tested it:

data[, zero_one := ifelse(test = (data[, yes_no] == "yes"), yes = 1, no = 0)]

zero_one is a new variable assigned using := and the computation on the right is applied to each value in the data table.

Upvotes: 0

Related Questions