Alon Hazan
Alon Hazan

Reputation: 300

Efficiently define new columns on data.table with conditions

I have large data.table with 6 million rows.

Sample data is:

DT:

VAR1 VAR2 VAR3.....
6           5
NA    5     1
"?"   2     0

I like to redefine two columns. I used the code below and it works very slow.

DT=DT[,':='(VAR1=ifelse(DT$VAR1 == "?" | is.na(DT$VAR1) |  is.null(DT$VAR1) | 
                  length(DT$VAR1) == 0 ,  1 , DT$VAR1),
            VAR2 = ifelse ( DT$VAR2 == "?" | is.na(DT$VAR2) |  
                  is.null(DT$VAR2) | length(DT$VAR2) == 0, "" , DT$VAR2)) ,with=FALSE]

Is there a way I can modify the code so it work faster?

Upvotes: 0

Views: 110

Answers (1)

David Arenburg
David Arenburg

Reputation: 92302

I'm not sure what exactly you trying to do, but you have several fundamental mistakes here regarding how to operate with a data.table object:

  1. No need in doing DT=DT as := updates DT by reference
  2. No need in referencing to DT within the scope of a data.table object by doing DT$VAR1, just do VAR1 instead
  3. ifelse is very inefficient and slow (despite being vectorized) so always try to avoid it (especially imbedded ifelse). In your case, I'm suspecting that you are trying to replace everything that is not a number with 1 in VAR1 and with "" in VAR2, correct? If so, you could only look for the numeric values instead, and hence, avoid the ifelse statement

Try

DT[!grepl("\\d", VAR1), VAR1 := "1"]
DT[!grepl("\\d", VAR2), VAR2 := ""]

Which will return (for your example data set)

DT
#    VAR1 VAR2 VAR3
# 1:    6         5
# 2:    1    5    1
# 3:    1    2    0

Upvotes: 2

Related Questions