Reputation: 277
Consider a sample dataset:
dt <- data.table(data.frame(V1 = c("C1/R3","M2/R4")))
> dt
V1
1: C1/R3
2: M2/R4
For each row of dt
, I want extract the concatenated characters C,M, or R. For example,
dt[,V2 := stri_join_list(str_match_all(V1,"[CMR],sep="",collapse=""),by=seq_len(nrow(dt))]
> dt
V1 V2
1: C1/R3 CR
2: M2/R4 MR
However, I have 42 million rows and the above code is not nearly efficient enough. Is there a way to do this without using row-wise operations? When I skip the by argument I get entry CRMR
for each row.
Upvotes: 0
Views: 138
Reputation: 1369
If, as you stated, you only wish to capture the letters C, M and R into a new column in your data.table
, then the following should work efficiently by assigning in place:
dt[, V2 := gsub('[^CMR]', '', V1, perl=TRUE, useBytes=TRUE)]
The pattern [^CMR]
matches any character that is not C M or R then we substitute for an empty string ''
.
Per the help from ?gsub
: "If you can make use of useBytes = TRUE, the strings will not be checked before matching, and the actual matching will be faster."
Finally, from what I have read, using perl=TRUE
I believe should be faster than omitting it. However, perhaps you could test both ways and reply with results using your real data to confirm for us?
Upvotes: 0
Reputation: 520898
One option uses sub
:
dt <- data.table(data.frame(V1 = c("C1/R3","M2/R4")))
dt$V2 <- sub("^([A-Z]+)[0-9]+/([A-Z]+)[0-9]+", "\\1\\2", dt$V1)
dt
V1 V2
1 C1/R3 CR
2 M2/R4 MR
Upvotes: 1