Reputation: 1250
s1 <- "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC"
s2 <- "A*01 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC"
s3 <- "A*01 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"
How to concatenate these strings using the identifier "A*01"?
Expected output:
sT <- "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"
Upvotes: 1
Views: 164
Reputation: 78792
For a more generic solution, I'm assuming you've got a file with a bunch of lines that look like the ones in the question. If that's the case, then the following should give you what you need.
library(stringr)
library(plyr)
dat <- readLines(textConnection("A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC
A*01 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC
A*01 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG
A*02 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC
A*02 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC
A*02 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG
A*03 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC
A*04 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC
A*04 TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"))
dat.df <- data.frame(prefix=str_match(dat, "(^A\\*[0-9]+) ")[,2],
sequence=str_match(dat, "\ (.*)$")[,2], stringsAsFactors=FALSE)
res <- daply(dat.df, .(prefix), .fun=function(x) {
return(paste(x[1,]$prefix, paste(x$sequence, sep=" ", collapse=" "),
sep=" ", collapse=""))
})
names(res) <- NULL
print(res)
## [1] "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"
## [2] "A*02 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"
## [3] "A*03 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC"
## [4] "A*04 TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"
Upvotes: 1
Reputation: 61154
Try this
> concat <- paste(s1, sub("A[*]01 ", "", s2), sub("A[*]01 ", "", s3))
> identical(sT, concat)
[1] TRUE
concat
looks like this
> concat
[1] "A*01 ATG GCC GTC ATG GCG CCC CGA ACC CTC CTC CTG CTA CTC TCG GGG GCC CTG GCC TCC CAC TCC ATG AGG TAT TTC TTC ACA TCC GTG TCC CCC GGC CGC GGG GAG CCC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AAG"
Upvotes: 1
Reputation: 78792
gsub(" A\\*01 ", " ", paste(s1, s2, s3, sep=" ", collapse=""))
will do what you want it to do in this case, but I suspect you might need a more generic solution in the long run.
Upvotes: 1