How to write specific line lengths of a file?

Question

I have this sequences (over 9000) like this:

>TsM_000224500 
MTTKWPQTTVTVATLSWGMLRLSMPKVQTTYKVTQSRGPLLAPGICDSWSRCLVLRVYVDRRRPGGDGSLGRVAVTVVETGCFGSAASFSMWVFGLAFVVTIEEQLL
>TsM_000534500 
MHSHIVTVFVALLLTTAVVYAHIGMHGEGCTTLQCQRHAFMMKEREKLNEMQLELMEMLMDIQTMNEQEAYYAGLHGAGMQQPLPMPIQ
>TsM_000355900 
MESGEENEYPMSCNIEEEEDIKFEPENGKVAEHESGEKKESIFVKHDDAKWVGIGFAIGTAVAPAVLSGISSAAVQGIRQPIQAGRNNGETTEDLENLINSVEDDL

The lines containing the ">" are the ID's and the lines with the letters are the amino acid (aa) sequences. I need to delete (or move to another files) the sequences below 40 aa and over 4000 aa. Then, the resulting file, should contain only the sequences within this range (>= 40 aa and <= 4K aa).

I've tried writing the following script:

def read_seq(file_name):
    with open(file_name) as file:
        return file.read().split('
')[0:]

ts = read_seq("/home/tiago/t_solium/ts_phtm0less.txt")

tsf = open("/home/tiago/t_solium/ts_secp-404k", 'w')

for x in range(len(ts)):
    if ([x][0:1] != '>'):
        if (len([x]) > 40 or len([x]) < 4000):

            tsf.write('%s
'%(x))

tsf.close()

print "OK!"

I've done some modifications, but all I'm getting are empty files or with all the +9000 sequences.

johmsp · Accepted Answer

In your for loop, x is an iterating integer due to using range() (i.e, 0,1,2,3,4...). Try this instead:

for x in ts:

This will give you each element in ts as x

Also, you don't need the brackets around x; Python can iterate over the characters in strings on its own. When you put brackets around a string, you put it into a list, and thus if you tried, for example, to get the second character in x: [x][1], Python will try to get the second element in the list that you put x in, and will run into problems.

EDIT: To include IDs, try this:

NOTE: I also changed if (len(x) > 40 or len(x) < 4000) to if (len(x) > 40 and len(x) < 4000) -- using and instead of or will give you the result you're looking for.

for i, x in enumerate(ts): #NEW: enumerate ts to get the index of every iteration (stored as i)
    if (x[0] != '>'):
        if (len(x) > 40 and len(x) < 4000):
            tsf.write('%s
'%(ts[i-1])) #NEW: write the ID number found on preceding line
            tsf.write('%s
'%(x))

How to write specific line lengths of a file?

Answers (2)

Related Questions