Reputation: 164
Having a fasta file containing protein sequences like these two showing below, I would like to count how many times the amino acid A appears in each sequence.
>sp|P01920|DQB1_HUMAN HLA class II histocompatibility antigen, DQ beta 1 chain OS=Homo sapiens OX=9606 GN=HLA-DQB1 PE=1 SV=2
MSWKKALRIPGGLRAATVTLMLAMLSTPVAEGRDSPEDFVYQFKAMCYFTNGTERVRYVT
RYIYNREEYARFDSDVEVYRAVTPLGPPDAEYWNSQKEVLERTRAELDTVCRHNYQLELR
TTLQRRVEPTVTISPSRTEALNHHNLLVCSVTDFYPAQIKVRWFRNDQEETTGVVSTPLI
RNGDWTFQILVMLEMTPQHGDVYTCHVEHPSLQNPITVEWRAQSESAQSKMLSGIGGFVL
GLIFLGLGLIIHHRSQKGLLH
>sp|P18440|ARY1_HUMAN Arylamine N-acetyltransferase 1 OS=Homo sapiens OX=9606 GN=NAT1 PE=1 SV=2
MDIEAYLERIGYKKSRNKLDLETLTDILQHQIRAVPFENLNIHCGDAMDLGLEAIFDQVV
RRNRGGWCLQVNHLLYWALTTIGFETTMLGGYVYSTPAKKYSTGMIHLLLQVTIDGRNYI
VDAGFGRSYQMWQPLELISGKDQPQVPCVFRLTEENGFWYLDQIRREQYIPNEEFLHSDL
LEDSKYRKIYSFTLKPRTIEDFESMNTYLQTSPSSVFTSKSFCSLQTPDGVHCLVGFTLT
HRRFNYKDNTDLIEFKTLSEEEIEKVLKNIFNISLQRKLVPKHGDRFFTI
This code
library(seqinr)
data <- read.fasta(file = "yourlist.fasta", as.string = TRUE)
library(stringr)
ACount <- stri_count_regex("A",data)
gives the result showing on the picture.
Although the character A excists in both sequences they are not counted. Any ideas on why is this happening? Thank you for your interest.
Upvotes: 0
Views: 1398
Reputation: 41
There seem to be some mistakes on your code. Following your procedure, this worked fine by me:
library(seqinr)
data <- read.fasta(file = "yourlist.fasta", seqtype = "AA", as.string = TRUE, set.attributes = FALSE)
library(stringi)
ACount <- stri_count_regex(data, "A")
You have to specify with the seqtype
argument the type of sequence, being "DNA" the default. You have to change it to "AA" (protein).
The stri_count_regex
function is part of the stringi
base R package.
I get now:
> str(ACount)
int [1:2] 14 7
Upvotes: 0
Reputation: 3
I have some idea how I would do this, but I'm not sure if it will work in your string, but I thought I'd answer. You can probably detect a string using the package stringr
, using str_count. Here's some info https://stringr.tidyverse.org/reference/str_detect.html
I just made a short example with your string above.
dna<- "MDIEAYLERIGYKKSRNKLDLETLTDILQHQIRAVPFENLNIHCGDAMDLGLEAIFDQVVRRNRGGWCLQVNHLLYWALTTIGFETTMLGGYVYSTPAKKYSTGMIHLLLQVTIDGRNYIVDAGFGRSYQMWQPLELISGKDQPQVPCVFRLTEENGFWYLDQIRREQYIPNEEFLHSDLLEDSKYRKIYSFTLKPRTIEDFESMNTYLQTSPSSVFTSKSFCSLQTPDGVHCLVGFTLTHRRFNYKDNTDLIEFKTLSEEEIEKVLKNIFNISLQRKLVPKHGDRFFTI"
str_count(string= dna, pattern= "VGFTL")
#1
Or I saw online the package sequences
but it only counts the ’A’, ’C’, ’G’ and ’T’ bases, so that's not going to show you a string. Here is the cran just in case you want to take a look. https://cran.r-project.org/web/packages/sequences/sequences.pdf
Upvotes: 0