Reputation: 1430
I've got a string column (data.table) that I need to parse based on a pattern (text between '-' ) and a defined (but variable) number of instances of that pattern and I'm not sure how to do it using regex:
> test <- c("AAA-bb-ccc", "abcd-efgh","blah", "blah-blah-blah-blah")
say, the predefined number of instances is i.
> i = 1
> output
"AAA" "abcd" "blah" "blah
> i = 2
> output
"bb" "efgh" "" "blah"
> i= 3
> output
"ccc" "" "" "blah"
how would i use a general regex using i that would achieve this?
Upvotes: 1
Views: 72
Reputation: 378
We can also use tokenize_regex
from the tokenizers
package and then data.table::transpose
and cbind
relevant columns into a data.table
test <- c("AAA-bb-ccc", "abcd-efgh","blah", "blah-blah-blah-blah")
library(tokenizers)
library(data.table)
test <- transpose(tokenize_regex(test, "-"), fill = "")
i <- 1:3
as.data.table(do.call(cbind, test[i]))
# V1 V2 V3
#1: AAA bb ccc
#2: abcd efgh
#3: blah
#4: blah blah blah
Upvotes: 1
Reputation: 13125
For i=3
you can try
unlist(lapply(strsplit(test,split = '-'),'[',3))
[1] "ccc" NA NA "blah"
Upvotes: 1
Reputation: 388862
We can create a function which splits on "-" and returns the ith value.
get_i_th_element <- function(test, i) {
sapply(strsplit(test, "-"), function(x) if(length(x) >= i) x[[i]] else "")
}
get_i_th_element(test, 1)
#[1] "AAA" "abcd" "blah" "blah"
get_i_th_element(test, 3)
#[1] "ccc" "" "" "blah"
Upvotes: 1