sheraz iqbal
sheraz iqbal

Reputation: 1

How to read the FASTA format file and count the sequence of letters inside the file

I want to count the total of letters (A...Z) in an alphabet but my code is not counting correctly.

from Bio import SeqIO
PSEQ=[repr(seq_record.seq)for seq_record in SeqIO.parse("data.txt","fasta")]
print(PSEQ)
print(len(PSEQ))
PSEQ_ID=[(seq_record.id)for seq_record in SeqIO.parse("data.txt","fasta")]
print(PSEQ_ID)
    PSEQ_ID=([i for i in PSEQ_ID[0:]])
    PSEQ=([i for i in PSEQ[0:]])
    print(len(PSEQ[0:]))
    A=[i.count("A")for i in PSEQ]
    B=[i.count("B")for i in PSEQ]
    C=[i.count("C")for i in PSEQ]
    D=[i.count("D")for i in PSEQ]
    E=[i.count("E")for i in PSEQ]
    F=[i.count("F")for i in PSEQ]
    G=[i.count("G")for i in PSEQ]
    H=[i.count("H")for i in PSEQ]
    I=[i.count("I")for i in PSEQ]
    J=[i.count("J")for i in PSEQ]
    K=[i.count("K")for i in PSEQ]
    L=[i.count("L")for i in PSEQ]
    M=[i.count("M")for i in PSEQ]
    N=[i.count("N")for i in PSEQ]
    O=[i.count("O")for i in PSEQ]
    P=[i.count("P")for i in PSEQ]
    Q=[i.count("Q")for i in PSEQ]
    R=[i.count("R")for i in PSEQ]
    S=[i.count("S")for i in PSEQ]
    T=[i.count("T")for i in PSEQ]
    U=[i.count("U")for i in PSEQ]
    V=[i.count("V")for i in PSEQ]
    W=[i.count("W")for i in PSEQ]
    X=[i.count("X")for i in PSEQ]
    Y=[i.count("Y")for i in PSEQ]
    Z=[i.count("Z")for i in PSEQ]

    All={"A":A,"B":B,"C":C,"D":D,"E":E,"F":F,"G":G,"H":H,
         "I":I,"J":J,"k":K,"L":L,"M":M,"N":N,"O":O,"P":P,"Q":Q,
         "R":R,"S":S,"T":T,"U":U,"V":V,"W":W,"X":X,"Y":Y,"Z":Z}
    #print(All)

    import pandas as pd
    df=pd.DataFrame(All)
    print(df)

Here is my problem, I want the result like this.

Because the letter of A in my file, is 7 times but here it's showed 4 times. I want the result A should be 7 times according to my file data.

Upvotes: 0

Views: 377

Answers (2)

Konrad Rudolph
Konrad Rudolph

Reputation: 545776

Your code

  • contains a lot of redundancy (by repeating the counting for each letter) that you should automate away
  • creates lists of counts instead of counts, for each letter (is this intentional? I don’t think so).

Both can be fixed by using collections.Counter to count items (in this case, letters). Then the entire code reduces to:

from Bio import SeqIO
from collections import Counter
import pandas as pd

frequencies = Counter()

for rec in SeqIO.parse(filename , 'fasta'):
    frequencies.update(rec.seq)

df = pd.DataFrame.from_dict(frequencies, orient='index')
print(df)

This merges the counts for each sequence in the FASTA file. If you want to keep them separate, just maintain a dictionary/list of Counters, instead of a single Counter.

Upvotes: 0

Tayyab Vohra
Tayyab Vohra

Reputation: 1672

read the fasta format with function and use count() for counting the alphabet sequence.

from Bio.Seq import Seq
from Bio.Alphabet import generic_dna, generic_protein
def read_fasta(fp):
        name, seq = None, []
        for line in fp:
            line = line.rstrip()
            if line.startswith(">"):
                if name: yield (name, ''.join(seq))
                name, seq = line, []
            else:
                seq.append(line)
        if name: yield (name, ''.join(seq))

with open('protein.fasta') as fp:
    for name, seq in read_fasta(fp):
        print(seq.count("A"))

Upvotes: 1

Related Questions