Reputation: 2545
I have a set of strings which has an ID that starts with >
. I would like to get the strings that follow each ID on one line, and not separate on multiple lines like they are now. The string can sometimes be separated on 1,2 or 3 lines.
fileName="hairpin"
conn=file(fileName,open="r")
linn=readLines(conn)
for (i in 1:length(linn)){
print(linn[i])
}
close(conn)
head(linn)
[1] ">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop"
[2] "UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAAC"
[3] "UAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA"
[4] ">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop"
[5] "AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCU"
[6] "GGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU
output
[1] ">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop" "UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA"
[4] ">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop" "AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCUGGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU"
I found the solution in anothet website:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa
Upvotes: 1
Views: 97
Reputation: 269644
Try this:
g <- cumsum(grepl("^>", Lines)) # equals 1 for first group, 2 for second, etc.
unname(unlist(tapply(Lines, g, function(x) c(x[1], paste(x[-1], collapse = "")))))
giving:
[1] ">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop"
[2] "UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA"
[3] ">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop"
[4] "AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCUGGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU"
Note The input Lines
is:
Lines <- c(">cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop",
"UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAAC",
"UAUGCAAUUUUCUACCUUACCGGAGACAGAACUCUUCGA",
">cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop",
"AUGCUUCCGGCCUGUUCCCUGAGACCUCAAGUGUGAGUGUACUAUUGAUGCUUCACACCU",
"GGGCUCUCCGGGUACCAGGACGGUUUGAGCAGAU")
Upvotes: 1