Reputation: 300
I have large data.table with 6 million rows.
Sample data is:
DT:
VAR1 VAR2 VAR3.....
6 5
NA 5 1
"?" 2 0
I like to redefine two columns. I used the code below and it works very slow.
DT=DT[,':='(VAR1=ifelse(DT$VAR1 == "?" | is.na(DT$VAR1) | is.null(DT$VAR1) |
length(DT$VAR1) == 0 , 1 , DT$VAR1),
VAR2 = ifelse ( DT$VAR2 == "?" | is.na(DT$VAR2) |
is.null(DT$VAR2) | length(DT$VAR2) == 0, "" , DT$VAR2)) ,with=FALSE]
Is there a way I can modify the code so it work faster?
Upvotes: 0
Views: 110
Reputation: 92302
I'm not sure what exactly you trying to do, but you have several fundamental mistakes here regarding how to operate with a data.table
object:
DT=DT
as :=
updates DT
by referenceDT
within the scope of a data.table
object by doing DT$VAR1
, just do VAR1
insteadifelse
is very inefficient and slow (despite being vectorized) so always try to avoid it (especially imbedded ifelse
). In your case, I'm suspecting that you are trying to replace everything that is not a number with 1
in VAR1
and with ""
in VAR2
, correct? If so, you could only look for the numeric values instead, and hence, avoid the ifelse
statementTry
DT[!grepl("\\d", VAR1), VAR1 := "1"]
DT[!grepl("\\d", VAR2), VAR2 := ""]
Which will return (for your example data set)
DT
# VAR1 VAR2 VAR3
# 1: 6 5
# 2: 1 5 1
# 3: 1 2 0
Upvotes: 2