JoseMA
JoseMA

Reputation: 31

How to convert multiple fasta lines in a matrix in python?

I have a file (txt or fasta) like this. Each sequence is located only in a single line.

    >Line1
    ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC
    >Line2
    ATTGCGCTANANAGCTANANCGATAGANCACGAAAGAGATAGACTATAGC
    >Line3
    ATCGCGCTANANAGCTANANGGCTAGANCNCGAAAGNGATAGACTATAGC
    >Line4
    ATTGCGCTANANAGCTANANGGATAGANCACGAGAGAGATAGACTATAGC
    >Line5
    ATTGCGCTANANAGCTANANCGATAGANCACGATNGAGATAGACTATAGC

I have to get a matrix in which each position correspond to each of the letters (nucleotides) of the sequences. In this case a matrix of (5x50). I've been dealing with numpy methods. I hope someone could help me.

Upvotes: 3

Views: 1977

Answers (3)

Kevin
Kevin

Reputation: 8207

If you are working with DNA sequence data in python, I would recommend using the Biopython library. You can install it with pip install biopython.

Here is how you would achieve your desired result:

from Bio import SeqIO
import os
import numpy as np

pathToFile = os.path.join("C:\\","Users","Kevin","Desktop","test.fasta")  #windows machine

allSeqs = []
for seq_record in SeqIO.parse(pathToFile, """fasta"""):
        allSeqs.append(seq_record.seq)

seqMat = np.array(allSeqs)

But in the for loop, each seq_record.seq is a Seq object, giving you the flexibility to perform operations on them.

In [5]: seqMat.shape
Out[5]: (5L, 50L)

You can slice your seqMat array however you like.

In [6]: seqMat[0]
Out[6]: array(['A', 'T', 'C', 'G', 'C', 'G', 'C', 'T', 'A', 'N', 'A', 'N', 'A',
       'G', 'C', 'T', 'A', 'N', 'A', 'N', 'A', 'G', 'C', 'T', 'A', 'G',
       'A', 'N', 'C', 'A', 'C', 'G', 'A', 'T', 'A', 'G', 'A', 'G', 'A',
       'G', 'A', 'G', 'A', 'C', 'T', 'A', 'T', 'A', 'G', 'C'], 
      dtype='|S1')

Highly recommend checking out the tutorial though!

Upvotes: 1

Arijit
Arijit

Reputation: 185

One way of achieving the matrix is to read the content of the file and converting it into a list where each element of the list is the sequence present in each line.And then you can access your matrix as a 2D Data Structure. Ex: [ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC]

filePath = "file path containing the sequence"

List that store the sequence as a matrix

listFasta =list ((open(filePath).read()).split("\n"))
for seq in listFasta:
    for charac in seq:
        print charac

Another way to access each element of your matrix

for seq in range(len(listFasta)):
    for ch in range(len(listFasta[seq])):
        print listFasta[seq][ch]

Upvotes: 0

Philipp Braun
Philipp Braun

Reputation: 1573

I hope this short bit of code helps. You basically need to split the string into a character array. After that you just put everything into a matrix.

Line1 = "ATGC"
Line2 = "GCTA"
Matr1 = np.matrix([n for n in Line1], [n for n in Line2])

Matr1[0,0] will return the first element in your matrix.

Upvotes: 0

Related Questions