Sohaib Anwaar
Sohaib Anwaar

Reputation: 1547

Read text file of protein sequences in python

I am trying to read DNA Sequences in Pandas Data frame but not getting the whole sequence in Data frame column.

I have tried File.open method simple read_csv method these methods didn't help me much.

pd.read_csv('../input/data 1/non-cpp.txt', index_col=0, header=None)

Output:

0
>
GNNRPVYIPQPRPPHPRI
>
HGVSGHGQHGVHG
>

myfile = open("../input/data 1/non-cpp.txt")
for line in myfile:
    print(line)
myfile.close()

>

GNNRPVYIPQPRPPHPRI

>

HGVSGHGQHGVHG

>

QRFSQPTFKLPQGRLTLSRKF

>

FLPVLAGIAAKVVPALFCKITKKC

DataSet Source

Here are some of sequences I want to read

I need labels in one column which you can see in 1st and whole sequence in the second column which you can see in second row e.g

Label

Sequence

Upvotes: 0

Views: 710

Answers (2)

Constanza Garcia
Constanza Garcia

Reputation: 366

this is a rough not one liner but it will give you what you need, a series with the DNA sequences.

import pandas as pd

data = pd.read_csv('cpp.txt', sep=">",header=None)

data[0].dropna()

I hope it helps

Upvotes: 1

sentence
sentence

Reputation: 8913

Let's say your file is something like:

>a1|b1|c1
a111
>a2|b2|c2
a222
>a3|b3|c3
a333

Note that here we have 6 lines.

Then, you can read the file, and store the data:

import pandas as pd

with open('filename.txt', 'r') as f:
    content = f.readlines()

n = len(content)

label = [content[i].strip() for i in range(0,n,2)]
seq = [content[i].strip() for i in range(1,n,2)]

df = pd.DataFrame({'label':label,
                   'sequence':seq})

and you get a pandas dataframe:

      label sequence
0   >a1|b1|c1   a111
1   >a2|b2|c2   a222
2   >a3|b3|c3   a333

Upvotes: 0

Related Questions