hari
hari

Reputation: 61

Parsing the result using python

I am a beginner in Python (I am a biologist) and I have a file with the results from a particular software and i would like to parse the result using python. From the following output I would like to get just the score and would like to split the sequence into individual amino acids.

no. score Sequence

1   0.273778    FFHH-YYFLHRRRKKCCNNN-CCCK---HQQ---HHKKHV-FGGGE-EDDEDEEEEEEEE-EE--
2   0.394647    IIVVIVVVVIVVVVVVVVVV-CCCVA-IVVI--LIIIIIIIIYYYA-AVVVVVVVAAAAV-AST-
3   0.456667        FIVVIVVVVIXXXXIGGGGT-CCCCAV -------------IVBBB-AAAAAA--------AAAA-  
4   0.407581    MMLMILLLLMVVAIILLIII-LLLIVLLAVVVVVAAAVAAVAIIII-ILIIIIIILVIMKKMLA-
5   0.331761    AANSRQSNAAQRRQCSNNNR-RALERGGMFFRRKQNNQKQKKHHHY-FYFYYSNNWWFFFFFFR-
6   0.452381    EEEEDEEEEEEEEEEEEEEE-EEEEESSTSTTTAEEEEEEEEEEEE-EEEEEEEEEEEEEEEEE-
7   0.460385    LLLLLLLLMMIIILLLIIII-IIILLVILMMEEFLLLLILIVLLLM-LLLLLLLLLLVILLLVL-
8   0.438680    ILILLVVVVILVVVLQLLMM-QKQLIVVLLVIIMLLLLMLLSIIIS-SMMMILFFLLILIIVVL-
9   0.393291    QQQDEEEQAAEEEDEKGSSD-QQEQDDQDEEAAAHQLESSATVVQR-QQQQQVVYTHSTVTTTE-

From the above table,I would like to get a table with the same number,score but the sequences separated individually (columnwise) so it should look like

no.      score         amino acid(1st column)

1      0.273778         F

2      0.395657         I

3      0.456667         F

another table representing the second column of amino acids

no       score       amino acid (2nd column)

1       0.273778         F

2       0.395657         I

3       0.456667         I  

third table representing the third column of amino acids and fourth table for 4th column of amino acids and so on

Thanks in advance for the help

Upvotes: 1

Views: 525

Answers (4)

Abdurahman
Abdurahman

Reputation: 658

Here is a simple working solution:

#opening file: "db.txt" full path to file if it is in the same directory as python file 
#you can use any extension for the file ,'r' for reading mode
filehandler=open("db.txt",'r') 
#Saving all the lines once in a list every line is a list member
#Another way: you can read it line by line
LinesList=filehandler.readlines()
#creating an empty multi dimension list to store your results
no=[]
Score=[]
AminoAcids=[] # this is a multi-dimensional list for example index 0 has a list of char. of first line and so on
#process each line assuming constant spacing in the input file
#no is the first char. score from char 4 to 12 and Amino from 16 to end
for Line in LinesList:
    #add the no
    no.append(Line[0])
    #add the score
    Score.append(Line[4:12])
    Aminolist=list(Line[16:]) #breaking the amino acid as each character is a list element
    #add Aminolist to the AminoAcids Matrix (multi-dimensional array)
    AminoAcids.append(Aminolist)

#you can now play with the data!
#printing Tables ,you can also write them into a file instead
for k in range(0,65):
    print"Table %d" %(k+1) # adding 1 to not be zero indexed
    print"no. Score      amino acid(column %d)" %(k+1)
    for i in range(len(no)):
        print "%s   %s   %s" %(no[i],Score[i],AminoAcids[i][k])

Here is a part of result appears on the console:

Table 1
no. Score      amino acid(column 1)
1   0.273778   F
2   0.394647   I
3   0.456667   F
4   0.407581   M
5   0.331761   A
6   0.452381   E
7   0.460385   L
8   0.438680   I
9   0.393291   Q
Table 2
no. Score      amino acid(column 2)
1   0.273778   F
2   0.394647   I
3   0.456667   I
4   0.407581   M
5   0.331761   A
6   0.452381   E
7   0.460385   L
8   0.438680   L
9   0.393291   Q
Table 3
no. Score      amino acid(column 3)
1   0.273778   H
2   0.394647   V
3   0.456667   V
4   0.407581   L
5   0.331761   N
6   0.452381   E
7   0.460385   L
8   0.438680   I
9   0.393291   Q
Table 4
no. Score      amino acid(column 4)
1   0.273778   H
2   0.394647   V
3   0.456667   V
4   0.407581   M
5   0.331761   S
6   0.452381   E
7   0.460385   L
8   0.438680   L
9   0.393291   D
Table 5
no. Score      amino acid(column 5)
1   0.273778   -
2   0.394647   I
3   0.456667   I
4   0.407581   I
5   0.331761   R
6   0.452381   D
7   0.460385   L
8   0.438680   L
9   0.393291   E
Table 6
no. Score      amino acid(column 6)
1   0.273778   Y
2   0.394647   V
3   0.456667   V
4   0.407581   L
5   0.331761   Q
6   0.452381   E
7   0.460385   L
8   0.438680   V
9   0.393291   E
Table 7
no. Score      amino acid(column 7)
1   0.273778   Y
2   0.394647   V
3   0.456667   V
4   0.407581   L
5   0.331761   S
6   0.452381   E
7   0.460385   L
8   0.438680   V
9   0.393291   E

Upvotes: 0

Ski
Ski

Reputation: 14487

From your example I guess that:

  • you want to save each table to a different results file.
  • each sequence is 65 characters long
  • some sequences contains meaningless white-spaces which has to be removed (line 3 in your example)

Here is my code sample, it reads data from input.dat and writes results to result-column-<number>.dat:

import re
import sys

# I will write each table to different results-file.
# dictionary to map columns (numbers) to opened file objects:
resultfiles = {}


def get_result_file(column):
    # helper to easily access results file.
    if column not in resultfiles:
        resultfiles[column] = open('result-column-%d.dat' % column, 'w')
    return resultfiles[column]


# iterate over data:
for line in open('input.dat'):
    try:
        # str.split(separator, maxsplit)
        # with `maxsplit`=2 it is more fail-proof:
        no, score, seq = line.split(None, 2)

        # from your example I guess that white-spaces in sequence are meaningless,
        # however in your example one sequence contains white-space, so I remove it:
        seq = re.sub('\s+', '', seq)

        # data validation will help to spot problems early:
        assert int(no), no          
        assert float(score), score
        assert len(seq) == 65, seq

    except Exception, e:
        # print the error and continue to process data:
        print >> sys.stderr, 'Error %s in line: %s.' % (e, line)
        continue  # jump to next iteration of for loop.

    # int(), float() will rise ValueError if no or score aren't numbers
    # assert <condition> will rise AssertionError if condition is False.

    # iterate over each character in amino sequance:
    for column, char in enumerate(seq, 1):
        f = get_result_file(column)
        f.write('%s    %s    %s\n' % (no, score, char))


# close all opened result files:
for f in resultfiles.values():
    f.close()

Notable functions used in this example:

Upvotes: 0

eyquem
eyquem

Reputation: 27575

I don't think it's useful to create tables.
Just put the data in an adapted structure and use a function that displays what you need at the moment you need:

with open('bio.txt') as f:
    data = [line.rstrip().split(None,2) for line in f if line.strip()]


def display(data,nth,pat='%-6s  %-15s  %s',uz=('th','st','nd','rd')):
    print pat % ('no.','score',
                 'amino acid(%d%s column)' %(nth,uz[0 if nth//4 else nth]))
    print '\n'.join(pat % (a,b,c[nth-1]) for a,b,c in data)    

display(data,1)
print
display(data,3)
print
display(data,7)

result

no.     score            amino acid(1st column)
1       0.273778         F
2       0.394647         I
3       0.456667         F
4       0.407581         M
5       0.331761         A
6       0.452381         E
7       0.460385         L
8       0.438680         I
9       0.393291         Q

no.     score            amino acid(3rd column)
1       0.273778         H
2       0.394647         V
3       0.456667         V
4       0.407581         L
5       0.331761         N
6       0.452381         E
7       0.460385         L
8       0.438680         I
9       0.393291         Q

no.     score            amino acid(7th column)
1       0.273778         Y
2       0.394647         V
3       0.456667         V
4       0.407581         L
5       0.331761         S
6       0.452381         E
7       0.460385         L
8       0.438680         V
9       0.393291         E

Upvotes: 0

Fred Foo
Fred Foo

Reputation: 363547

Assuming you've opened the file containing this data as f, then your example can be reproduced with:

for ln in f:    # loop over all lines
    seqno, score, seq = ln.split()
    print("%s    %s    %s" % (seqno, score, seq[0]))

To split out the sequence, you need to additionally loop over the letters in seq:

for ln in f:
    seqno, score, seq = ln.split()
    for x in seq:
        print("%s    %s    %s" % (seqno, score, seq[0]))

This will print the sequence number and score lots of times. I'm not sure if that's what you want.

Upvotes: 5

Related Questions