BioChemoinformatics
BioChemoinformatics

Reputation: 715

In R: efficiently convert one format (character vector) to another format (numeric matrix)

Using one software, I can calculate the fingerprint like this:

>L
[1]    "1  1:1 2:1 3:1 5:1 6:1 8:1"
[2]    "5  1:1 2:1 4:1"
[3]    "9  1:1 2:1 7:1 10:1"

The first value: 1, 5, 9 is the corresponding molecular names, and the remaining is the corresponding finger prints, which have a fixed length, say 10. It means that one the left of ":" is the position and on the right is the bit, where 1 indicate having this bit, and 0 is omit (indicate no bit), so I would like to restore the original format. That is for the 10 bit, every bit should have corresponding value:

L should like this, I can save L as csv format.

mol 1 2 3 4 5 6 7 8 9 10
1   1 1 1 0 1 1 0 1 0 0
5   1 1 0 1 0 0 0 0 0 0
9   1 1 0 0 0 0 1 0 0 1

Here, the L have million rows, what is the efficient way to convert the wanted format?

Thanks.

Upvotes: 4

Views: 136

Answers (3)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193597

Update

To avoid read.csv just use strsplit and the non-exported splitstackshape:::numMat functions:

M <- strsplit(L, "\\s+|:")
cbind(mol = as.numeric(sapply(M, `[`, 1)),
    splitstackshape:::numMat(lapply(M, `[`, -1), fill=0))

Update 2: Benchmarks

For the curious....

The sample data:

L <- c("1  1:1 2:1 3:1 5:1 6:1 8:1",
       "5  1:1 2:1 4:1",
       "9  1:1 2:1 7:1 10:1")
M <- replicate(10000, L)

@thelatemail's answer:

fun1 <- function() {
  spl <- lapply(strsplit(M,"\\s+|:.? |:.$"),as.numeric)
  vals <- lapply(spl,"[",-1)

  data.frame(
    mol=sapply(spl,"[",1),
    t(sapply(vals, function(x) {
      out <- rep(0,max(unlist(vals)))
      out[x] <- 1
      out} ))
  )
} 

system.time(out_late <- fun1())
#    user  system elapsed 
#   98.36    1.28  100.06
head(out_late)
#   mol X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1   1  1  1  1  0  1  1  0  1  0   0
# 2   5  1  1  0  1  0  0  0  0  0   0
# 3   9  1  1  0  0  0  0  1  0  0   1
# 4   1  1  1  1  0  1  1  0  1  0   0
# 5   5  1  1  0  1  0  0  0  0  0   0
# 6   9  1  1  0  0  0  0  1  0  0   1

My updated answer:

library(splitstackshape)
fun2 <- function() {
  M <- strsplit(M, "\\s+|:")
  cbind(mol = as.numeric(sapply(M, `[`, 1)),
        splitstackshape:::numMat(lapply(M, `[`, -1), fill=0))
}

system.time(out_ananda <- fun2())
#    user  system elapsed 
#    0.67    0.00    0.68
head(out_ananda)
#      mol 1 2 3 4 5 6 7 8 9 10
# [1,]   1 1 1 1 0 1 1 0 1 0  0
# [2,]   5 1 1 0 1 0 0 0 0 0  0
# [3,]   9 1 1 0 0 0 0 1 0 0  1
# [4,]   1 1 1 1 0 1 1 0 1 0  0
# [5,]   5 1 1 0 1 0 0 0 0 0  0
# [6,]   9 1 1 0 0 0 0 1 0 0  1

@Matthew's answer. Note that this would need to be modified to accept different "val" values.

fun3 <- function() {
  t(sapply(strsplit(M, "\\s+"), function(l) {
    mol <- as.numeric(l[1])
    names(mol) <- 'mol'
    val <- numeric(10)
    names(val) <- 1:10
    for (x in strsplit(l[-1], ":"))
      val[x[1]] <- as.numeric(x[2])
    c(mol, val)
  }))
}

system.time(out_matthew <- fun3())
#    user  system elapsed 
#    2.33    0.00    2.34
head(out_matthew)
#      mol 1 2 3 4 5 6 7 8 9 10
# [1,]   1 1 1 1 0 1 1 0 1 0  0
# [2,]   5 1 1 0 1 0 0 0 0 0  0
# [3,]   9 1 1 0 0 0 0 1 0 0  1
# [4,]   1 1 1 1 0 1 1 0 1 0  0
# [5,]   5 1 1 0 1 0 0 0 0 0  0
# [6,]   9 1 1 0 0 0 0 1 0 0  1

Upvotes: 2

Matthew Lundberg
Matthew Lundberg

Reputation: 42669

Borrowing from thelatemail, here's an expression which returns a matrix of the proper elements. Rather than setting the value to 1, I set the value to whatever follows the : character in a for loop. Then the whole thing is transposed to give the format that you desire.

t(sapply(strsplit(L, "\\s+"), function(l) {
  # Each line is passed in as a vector, the first element is "mol"
  mol <- as.numeric(l[1])
  names(mol) <- 'mol'

  # Store the values in a vector of length 10, with names
  val <- numeric(10)
  names(val) <- 1:10

  # Split the tail of the input vector on ":" and assign to the proper slot of the output vector
  for (x in strsplit(l[-1], ":"))
     val[x[1]] <- as.numeric(x[2])

  # Put them back together
  c(mol, val)
}))

##      mol 1 2 3 4 5 6 7 8 9 10
## [1,]   1 1 1 1 0 1 1 0 1 0  0
## [2,]   5 1 1 0 1 0 0 0 0 0  0
## [3,]   9 1 1 0 0 0 0 1 0 0  1

Upvotes: 2

thelatemail
thelatemail

Reputation: 93908

An attempt using base R functions, assuming L is the same as used by @Ananda.

spl <- lapply(strsplit(L,"\\s+|:.? |:.$"),as.numeric)
vals <- lapply(spl,"[",-1)

data.frame(
  mol=sapply(spl,"[",1),
  t(sapply(vals, function(x) {
   out <- rep(0,max(unlist(vals)))
   out[x] <- 1
   out} ))
)

#  mol X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
#1   1  1  1  1  0  1  1  0  1  0   0
#2   5  1  1  0  1  0  0  0  0  0   0
#3   9  1  1  0  0  0  0  1  0  0   1

Upvotes: 2

Related Questions