How to efficiently match and combine strings in data.table

Consider a sample dataset:

dt <- data.table(data.frame(V1 = c("C1/R3","M2/R4")))
> dt
      V1
1: C1/R3
2: M2/R4

For each row of dt, I want extract the concatenated characters C,M, or R. For example,

dt[,V2 := stri_join_list(str_match_all(V1,"[CMR],sep="",collapse=""),by=seq_len(nrow(dt))]
> dt
         V1 V2
1:    C1/R3 CR
2:    M2/R4 MR

However, I have 42 million rows and the above code is not nearly efficient enough. Is there a way to do this without using row-wise operations? When I skip the by argument I get entry CRMRfor each row.

Upvotes: 0

Views: 138

Answers (2)

krads
krads

Reputation: 1369

If, as you stated, you only wish to capture the letters C, M and R into a new column in your data.table, then the following should work efficiently by assigning in place:

dt[, V2 := gsub('[^CMR]', '', V1, perl=TRUE, useBytes=TRUE)]

The pattern [^CMR] matches any character that is not C M or R then we substitute for an empty string ''.

Per the help from ?gsub: "If you can make use of useBytes = TRUE, the strings will not be checked before matching, and the actual matching will be faster."

Finally, from what I have read, using perl=TRUE I believe should be faster than omitting it. However, perhaps you could test both ways and reply with results using your real data to confirm for us?

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520898

One option uses sub:

dt <- data.table(data.frame(V1 = c("C1/R3","M2/R4")))
dt$V2 <- sub("^([A-Z]+)[0-9]+/([A-Z]+)[0-9]+", "\\1\\2", dt$V1)
dt
     V1 V2
1 C1/R3 CR
2 M2/R4 MR

Demo

Upvotes: 1

Related Questions