Reputation: 3082
I have a datatable object of around 0.8 million rows. Some columns in them have Yes
/No
values. I want to change it to 1
/0
respectively. Here workable_data
is my data.table object:
My code for changing (which seems too simplistic):
yes_No_to_1 <- function(x){
sapply(x,function(x){
if(x =='Yes'){
1
}else{
0}
})
}
sapply(workable_data[,product,with=FALSE],yes_No_to_1)
However there are two problem s: (1) it's taking a long time and (2) I want to change the values permanently in workable_data
in a data.table way.
You can reproduce sample data with:
product <- c("HasProduct1","HasProduct2","HasProduct3","HasProduct4","HasProduct5","HasProduct6","HasProduct7","HasProduct8","HasProduct9","HasProduct10")
workable_data <- as.data.table(data.frame(x=sample(1:100),sapply(product,function(x){x <-sample(c("Yes","No"),10000,replace = T)})))
I assume there is a better way to achieve this but I cant seem to figure it out and I would also appreciate any advise on optimization of my function (that is going to be the last resort). The time taken in this sample data is, which is pretty bad for 10k rows.
user system elapsed
6.21 0.00 6.25
Upvotes: 2
Views: 1769
Reputation: 83215
You could do this with the set
functionality in data.table:
1: Create a vector of columnnames in which you want to change the Yes
to 1
and the No
to 0
(like @Frank said in the comments)
cols <- grep("^HasProduct", names(DT), value = TRUE)
2: Change the values with the following for(...) set(...)
implementation (as rightfully pointed out by @Arun in the comments, you can also use as.integer
instead of just +
):
for (col in cols) set(DT, j = col, value = +(DT[[col]] == "Yes"))
this results in:
> DT x HasProduct1 HasProduct2 HasProduct3 HasProduct4 HasProduct5 HasProduct6 HasProduct7 HasProduct8 HasProduct9 HasProduct10 1: 23 0 1 0 1 0 0 1 0 0 0 2: 74 1 0 1 1 0 1 1 1 1 1 3: 35 1 1 0 0 0 1 1 1 0 1 4: 7 1 1 1 1 0 1 1 0 0 1 5: 92 0 1 1 1 1 1 0 1 1 0 --- 9996: 56 0 0 1 0 1 0 0 0 1 0 9997: 59 1 0 1 1 0 1 1 1 1 0 9998: 85 0 1 0 1 1 1 1 1 1 1 9999: 93 1 0 0 0 0 0 0 0 1 1 10000: 29 0 1 1 0 0 1 0 1 1 1
Timings:
user system elapsed
0.007 0.000 0.007
Used data:
set.seed(654)
product <- c("HasProduct1","HasProduct2","HasProduct3","HasProduct4","HasProduct5","HasProduct6","HasProduct7","HasProduct8","HasProduct9","HasProduct10")
DT <- as.data.table(data.frame(x=sample(1:100),sapply(product,function(x){x <-sample(c("Yes","No"),10000,replace = T)})))
Upvotes: 4
Reputation: 139
In R, factors are internally treated as 1,2,3... integers corresponding to each level within a factor. "No" comes in front of "Yes" in alphabetic order, so "No" == 1L, "Yes" == 2L ("L" is just setting a number as an integer). We can retain integers from a factor and subtract 1L to arrive to "No" == 0L, "Yes" == 1L.
The advantage of this approach as compared to one of Jaap above is that factors with arbitrary level numbers can be transformed in the same way. The computational times are largely equivalent, especially for larger tables.
require(data.table)
# generate example data
set.seed(2020)
product <- c("HasProduct1","HasProduct2","HasProduct3","HasProduct4","HasProduct5","HasProduct6","HasProduct7","HasProduct8","HasProduct9","HasProduct10")
DT <- as.data.table(data.frame(x=sample(1:100),sapply(product,function(x){x <-sample(c("Yes","No"),10000,replace = T)})))
# define the columns to process
cols <- grep("^HasProduct", names(DT), value=TRUE)
# this is another data.table way to do it
DT[, (cols) := lapply(.SD, function(x) return(as.integer(as.factor(x)) - 1L)), .SDcols = cols]
Upvotes: 0
Reputation: 188
Something like this should work, though I haven't tested it:
data[, zero_one := ifelse(test = (data[, yes_no] == "yes"), yes = 1, no = 0)]
zero_one
is a new variable assigned using :=
and the computation on the right is applied to each value in the data table.
Upvotes: 0