Tiago Minuzzi
Tiago Minuzzi

Reputation: 141

How to write specific line lengths of a file?

I have this sequences (over 9000) like this:

>TsM_000224500 
MTTKWPQTTVTVATLSWGMLRLSMPKVQTTYKVTQSRGPLLAPGICDSWSRCLVLRVYVDRRRPGGDGSLGRVAVTVVETGCFGSAASFSMWVFGLAFVVTIEEQLL
>TsM_000534500 
MHSHIVTVFVALLLTTAVVYAHIGMHGEGCTTLQCQRHAFMMKEREKLNEMQLELMEMLMDIQTMNEQEAYYAGLHGAGMQQPLPMPIQ
>TsM_000355900 
MESGEENEYPMSCNIEEEEDIKFEPENGKVAEHESGEKKESIFVKHDDAKWVGIGFAIGTAVAPAVLSGISSAAVQGIRQPIQAGRNNGETTEDLENLINSVEDDL

The lines containing the ">" are the ID's and the lines with the letters are the amino acid (aa) sequences. I need to delete (or move to another files) the sequences below 40 aa and over 4000 aa. Then, the resulting file, should contain only the sequences within this range (>= 40 aa and <= 4K aa).

I've tried writing the following script:

def read_seq(file_name):
    with open(file_name) as file:
        return file.read().split('\n')[0:]

ts = read_seq("/home/tiago/t_solium/ts_phtm0less.txt")

tsf = open("/home/tiago/t_solium/ts_secp-404k", 'w')

for x in range(len(ts)):
    if ([x][0:1] != '>'):
        if (len([x]) > 40 or len([x]) < 4000):

            tsf.write('%s\n'%(x))

tsf.close()

print "OK!"

I've done some modifications, but all I'm getting are empty files or with all the +9000 sequences.

Upvotes: 0

Views: 523

Answers (2)

johmsp
johmsp

Reputation: 294

In your for loop, x is an iterating integer due to using range() (i.e, 0,1,2,3,4...). Try this instead:

for x in ts:

This will give you each element in ts as x

Also, you don't need the brackets around x; Python can iterate over the characters in strings on its own. When you put brackets around a string, you put it into a list, and thus if you tried, for example, to get the second character in x: [x][1], Python will try to get the second element in the list that you put x in, and will run into problems.

EDIT: To include IDs, try this:

NOTE: I also changed if (len(x) > 40 or len(x) < 4000) to if (len(x) > 40 and len(x) < 4000) -- using and instead of or will give you the result you're looking for.

for i, x in enumerate(ts): #NEW: enumerate ts to get the index of every iteration (stored as i)
    if (x[0] != '>'):
        if (len(x) > 40 and len(x) < 4000):
            tsf.write('%s\n'%(ts[i-1])) #NEW: write the ID number found on preceding line
            tsf.write('%s\n'%(x))

Upvotes: 1

Wajahat
Wajahat

Reputation: 1633

Try this, simple and easy to understand. It does not load the entire file into memory, instead iterates over the file line by line.

tsf=open('output.txt','w') # open the output file
with open("yourfile",'r') as ts: # open the input file
    for line in ts: # iterate over each line of input file
        line=line.strip() # removes all whitespace at the start and end, including spaces, tabs, newlines and carriage returns.
        if line[0]=='>': # if line is an ID 
            continue # move to the next line
        else: # otherwise
            if (len(line)>40) or (len(line)<4000): # if line is in required length
                tsf.write('%s\n'%line) # write to output file

tsf.close() # done
print "OK!"

FYI, you could also use awk for a one line solution if working in unix environment:

cat yourinputfile.txt | grep -v '>' | awk 'length($0)>=40' | awk 'length($0)<=4000' > youroutputfile.txt

Upvotes: 0

Related Questions