J Ireland
J Ireland

Reputation: 21

How to efficiently extract delimited strings from a data table in R

I have a data table in R with text columns of colon delimited data. I want to return a matrix/data table of results where one of the delimited values is returned for each cell.

The code pasted below demonstrates the problem and is a working solution. However, my actual data table is large (a few thousand rows and columns), and the pasted method takes on the order of a minute or two to complete.

I'm wondering if there is a more efficient way to perform this task? It appears that the sep2 option in fread will be very useful for this problem once implemented.

Thanks!

> # Set up data.table
> DT <- data.table(A = c("cat:1:meow", "dog:2:bark", "cow:3:moo"),
                     B = c("dog:3:meow", "dog:4:bark", "frog:3:croak"),
                     C = c("dingo:0:moo", "cat:8:croak", "frog:1:moo"))
> print(DT)

        A            B           C
1: cat:1:meow   dog:3:meow dingo:0:moo
2: dog:2:bark   dog:4:bark cat:8:croak
3: cow:3:moo   frog:3:croak  frog:1:moo

# grab the second delimited value in each cell
> part_index <- 2
> f = function(x) {vapply(t(x), function(x) {unlist(strsplit(x, ":", fixed=T))[part_index]}, character(1))}

> sapply(DT, f)

    A   B   C  
[1,] "1" "3" "0"
[2,] "2" "4" "8"
[3,] "3" "3" "1"

Upvotes: 2

Views: 206

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269586

1) sub Try this:

DT[, lapply(.SD, sub, pattern = ".*:(.*):.*", replacement = "\\1")]

giving:

   A B C
1: 1 3 0
2: 2 4 8
3: 3 3 1

2) fread or using fread:

DT[, lapply(.SD, function(x) fread(paste(x, collapse = "\n"))$V2)]

3) matrix Note that similar code would work with plain character matrix without data.table:

m <- as.matrix(DT)

replace(m, TRUE, sub(".*:(.*):.*", "\\1", m))

giving:

     A   B   C  
[1,] "1" "3" "0"
[2,] "2" "4" "8"
[3,] "3" "3" "1"

3a) Even simpler (no regular expressions) would be:

replace(m, TRUE, read.table(text = m, sep = ":")$V2)

3b) or using fread from data.table:

replace(m, TRUE, fread(paste(m, collapse = "\n"))$V2)

Upvotes: 4

Related Questions