Reputation: 31
I have a file (txt or fasta) like this. Each sequence is located only in a single line.
>Line1
ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC
>Line2
ATTGCGCTANANAGCTANANCGATAGANCACGAAAGAGATAGACTATAGC
>Line3
ATCGCGCTANANAGCTANANGGCTAGANCNCGAAAGNGATAGACTATAGC
>Line4
ATTGCGCTANANAGCTANANGGATAGANCACGAGAGAGATAGACTATAGC
>Line5
ATTGCGCTANANAGCTANANCGATAGANCACGATNGAGATAGACTATAGC
I have to get a matrix in which each position correspond to each of the letters (nucleotides) of the sequences. In this case a matrix of (5x50). I've been dealing with numpy methods. I hope someone could help me.
Upvotes: 3
Views: 1977
Reputation: 8207
If you are working with DNA sequence data in python, I would recommend using the Biopython library. You can install it with pip install biopython
.
Here is how you would achieve your desired result:
from Bio import SeqIO
import os
import numpy as np
pathToFile = os.path.join("C:\\","Users","Kevin","Desktop","test.fasta") #windows machine
allSeqs = []
for seq_record in SeqIO.parse(pathToFile, """fasta"""):
allSeqs.append(seq_record.seq)
seqMat = np.array(allSeqs)
But in the for loop, each seq_record.seq
is a Seq
object, giving you the flexibility to perform operations on them.
In [5]: seqMat.shape
Out[5]: (5L, 50L)
You can slice your seqMat
array however you like.
In [6]: seqMat[0]
Out[6]: array(['A', 'T', 'C', 'G', 'C', 'G', 'C', 'T', 'A', 'N', 'A', 'N', 'A',
'G', 'C', 'T', 'A', 'N', 'A', 'N', 'A', 'G', 'C', 'T', 'A', 'G',
'A', 'N', 'C', 'A', 'C', 'G', 'A', 'T', 'A', 'G', 'A', 'G', 'A',
'G', 'A', 'G', 'A', 'C', 'T', 'A', 'T', 'A', 'G', 'C'],
dtype='|S1')
Highly recommend checking out the tutorial though!
Upvotes: 1
Reputation: 185
One way of achieving the matrix is to read the content of the file and converting it into a list where each element of the list is the sequence present in each line.And then you can access your matrix as a 2D Data Structure. Ex: [ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC]
filePath = "file path containing the sequence"
List that store the sequence as a matrix
listFasta =list ((open(filePath).read()).split("\n"))
for seq in listFasta:
for charac in seq:
print charac
Another way to access each element of your matrix
for seq in range(len(listFasta)):
for ch in range(len(listFasta[seq])):
print listFasta[seq][ch]
Upvotes: 0
Reputation: 1573
I hope this short bit of code helps. You basically need to split the string into a character array. After that you just put everything into a matrix.
Line1 = "ATGC"
Line2 = "GCTA"
Matr1 = np.matrix([n for n in Line1], [n for n in Line2])
Matr1[0,0]
will return the first element in your matrix.
Upvotes: 0