Reputation: 1547
I am trying to read DNA Sequences in Pandas Data frame but not getting the whole sequence in Data frame column.
I have tried File.open method simple read_csv method these methods didn't help me much.
pd.read_csv('../input/data 1/non-cpp.txt', index_col=0, header=None)
Output:
0
>
GNNRPVYIPQPRPPHPRI
>
HGVSGHGQHGVHG
>
myfile = open("../input/data 1/non-cpp.txt")
for line in myfile:
print(line)
myfile.close()
>
GNNRPVYIPQPRPPHPRI
>
HGVSGHGQHGVHG
>
QRFSQPTFKLPQGRLTLSRKF
>
FLPVLAGIAAKVVPALFCKITKKC
I need labels in one column which you can see in 1st and whole sequence in the second column which you can see in second row e.g
Label
Sequence
Upvotes: 0
Views: 710
Reputation: 366
this is a rough not one liner but it will give you what you need, a series with the DNA sequences.
import pandas as pd
data = pd.read_csv('cpp.txt', sep=">",header=None)
data[0].dropna()
I hope it helps
Upvotes: 1
Reputation: 8913
Let's say your file is something like:
>a1|b1|c1
a111
>a2|b2|c2
a222
>a3|b3|c3
a333
Note that here we have 6 lines.
Then, you can read the file, and store the data:
import pandas as pd
with open('filename.txt', 'r') as f:
content = f.readlines()
n = len(content)
label = [content[i].strip() for i in range(0,n,2)]
seq = [content[i].strip() for i in range(1,n,2)]
df = pd.DataFrame({'label':label,
'sequence':seq})
and you get a pandas dataframe:
label sequence
0 >a1|b1|c1 a111
1 >a2|b2|c2 a222
2 >a3|b3|c3 a333
Upvotes: 0