vitor
vitor

Reputation: 1250

concatenate strings using common identifier

s1 <- "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC"
s2 <- "A*01 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC"
s3 <- "A*01 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"

How to concatenate these strings using the identifier "A*01"?

Expected output:

sT <- "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"

Upvotes: 1

Views: 164

Answers (3)

hrbrmstr
hrbrmstr

Reputation: 78792

For a more generic solution, I'm assuming you've got a file with a bunch of lines that look like the ones in the question. If that's the case, then the following should give you what you need.

library(stringr)
library(plyr)

dat <- readLines(textConnection("A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC
A*01 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC
A*01 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG
A*02 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC
A*02 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC
A*02 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG
A*03 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC
A*04 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC
A*04 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"))


dat.df <- data.frame(prefix=str_match(dat, "(^A\\*[0-9]+) ")[,2],
                     sequence=str_match(dat, "\ (.*)$")[,2], stringsAsFactors=FALSE)

res <- daply(dat.df, .(prefix), .fun=function(x) {
  return(paste(x[1,]$prefix, paste(x$sequence, sep=" ", collapse=" "), 
               sep=" ", collapse=""))
})

names(res) <- NULL

print(res)

## [1] "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"
## [2] "A*02 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"
## [3] "A*03 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC"              
## [4] "A*04 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"                                                                        

Upvotes: 1

Jilber Urbina
Jilber Urbina

Reputation: 61154

Try this

> concat <- paste(s1, sub("A[*]01 ", "", s2), sub("A[*]01 ", "", s3))
> identical(sT, concat)
[1] TRUE

concat looks like this

> concat
[1] "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"

Upvotes: 1

hrbrmstr
hrbrmstr

Reputation: 78792

gsub(" A\\*01 ", " ", paste(s1, s2, s3, sep=" ", collapse=""))

will do what you want it to do in this case, but I suspect you might need a more generic solution in the long run.

Upvotes: 1

Related Questions