How to efficiently match and combine strings in data.table

Question

Consider a sample dataset:

dt <- data.table(data.frame(V1 = c("C1/R3","M2/R4")))
> dt
      V1
1: C1/R3
2: M2/R4

For each row of dt, I want extract the concatenated characters C,M, or R. For example,

dt[,V2 := stri_join_list(str_match_all(V1,"[CMR],sep="",collapse=""),by=seq_len(nrow(dt))]
> dt
         V1 V2
1:    C1/R3 CR
2:    M2/R4 MR

However, I have 42 million rows and the above code is not nearly efficient enough. Is there a way to do this without using row-wise operations? When I skip the by argument I get entry CRMRfor each row.

Tim Biegeleisen · Accepted Answer

One option uses sub:

dt <- data.table(data.frame(V1 = c("C1/R3","M2/R4")))
dt$V2 <- sub("^([A-Z]+)[0-9]+/([A-Z]+)[0-9]+", "\1\2", dt$V1)
dt
     V1 V2
1 C1/R3 CR
2 M2/R4 MR

How to efficiently match and combine strings in data.table

Answers (2)

Demo

Related Questions