How to count amino acids in a FASTA file with multiple protein sequences, using R

Question

Having a fasta file containing protein sequences like these two showing below, I would like to count how many times the amino acid A appears in each sequence.

>sp|P01920|DQB1_HUMAN HLA class II histocompatibility antigen, DQ beta 1 chain OS=Homo sapiens OX=9606 GN=HLA-DQB1 PE=1 SV=2
MSWKKALRIPGGLRAATVTLMLAMLSTPVAEGRDSPEDFVYQFKAMCYFTNGTERVRYVT
RYIYNREEYARFDSDVEVYRAVTPLGPPDAEYWNSQKEVLERTRAELDTVCRHNYQLELR
TTLQRRVEPTVTISPSRTEALNHHNLLVCSVTDFYPAQIKVRWFRNDQEETTGVVSTPLI
RNGDWTFQILVMLEMTPQHGDVYTCHVEHPSLQNPITVEWRAQSESAQSKMLSGIGGFVL
GLIFLGLGLIIHHRSQKGLLH

>sp|P18440|ARY1_HUMAN Arylamine N-acetyltransferase 1 OS=Homo sapiens OX=9606 GN=NAT1 PE=1 SV=2
MDIEAYLERIGYKKSRNKLDLETLTDILQHQIRAVPFENLNIHCGDAMDLGLEAIFDQVV
RRNRGGWCLQVNHLLYWALTTIGFETTMLGGYVYSTPAKKYSTGMIHLLLQVTIDGRNYI
VDAGFGRSYQMWQPLELISGKDQPQVPCVFRLTEENGFWYLDQIRREQYIPNEEFLHSDL
LEDSKYRKIYSFTLKPRTIEDFESMNTYLQTSPSSVFTSKSFCSLQTPDGVHCLVGFTLT
HRRFNYKDNTDLIEFKTLSEEEIEKVLKNIFNISLQRKLVPKHGDRFFTI

This code

library(seqinr)
data <- read.fasta(file = "yourlist.fasta", as.string = TRUE)
library(stringr)
ACount <- stri_count_regex("A",data)

gives the result showing on the picture.

Although the character A excists in both sequences they are not counted. Any ideas on why is this happening? Thank you for your interest.

dqt · Accepted Answer

There seem to be some mistakes on your code. Following your procedure, this worked fine by me:

library(seqinr)
data <- read.fasta(file = "yourlist.fasta", seqtype = "AA", as.string = TRUE, set.attributes = FALSE)

library(stringi)
ACount <- stri_count_regex(data, "A")

You have to specify with the seqtype argument the type of sequence, being "DNA" the default. You have to change it to "AA" (protein). The stri_count_regex function is part of the stringi base R package.

I get now:

> str(ACount)
 int [1:2] 14 7

How to count amino acids in a FASTA file with multiple protein sequences, using R

Answers (2)

Related Questions