Reputation: 141
I have this sequences (over 9000) like this:
>TsM_000224500
MTTKWPQTTVTVATLSWGMLRLSMPKVQTTYKVTQSRGPLLAPGICDSWSRCLVLRVYVDRRRPGGDGSLGRVAVTVVETGCFGSAASFSMWVFGLAFVVTIEEQLL
>TsM_000534500
MHSHIVTVFVALLLTTAVVYAHIGMHGEGCTTLQCQRHAFMMKEREKLNEMQLELMEMLMDIQTMNEQEAYYAGLHGAGMQQPLPMPIQ
>TsM_000355900
MESGEENEYPMSCNIEEEEDIKFEPENGKVAEHESGEKKESIFVKHDDAKWVGIGFAIGTAVAPAVLSGISSAAVQGIRQPIQAGRNNGETTEDLENLINSVEDDL
The lines containing the ">" are the ID's and the lines with the letters are the amino acid (aa) sequences. I need to delete (or move to another files) the sequences below 40 aa and over 4000 aa. Then, the resulting file, should contain only the sequences within this range (>= 40 aa and <= 4K aa).
I've tried writing the following script:
def read_seq(file_name):
with open(file_name) as file:
return file.read().split('\n')[0:]
ts = read_seq("/home/tiago/t_solium/ts_phtm0less.txt")
tsf = open("/home/tiago/t_solium/ts_secp-404k", 'w')
for x in range(len(ts)):
if ([x][0:1] != '>'):
if (len([x]) > 40 or len([x]) < 4000):
tsf.write('%s\n'%(x))
tsf.close()
print "OK!"
I've done some modifications, but all I'm getting are empty files or with all the +9000 sequences.
Upvotes: 0
Views: 523
Reputation: 294
In your for loop, x
is an iterating integer due to using range()
(i.e, 0,1,2,3,4...
). Try this instead:
for x in ts:
This will give you each element in ts
as x
Also, you don't need the brackets around x
; Python can iterate over the characters in strings on its own. When you put brackets around a string, you put it into a list, and thus if you tried, for example, to get the second character in x
: [x][1]
, Python will try to get the second element in the list that you put x
in, and will run into problems.
EDIT: To include IDs, try this:
NOTE: I also changed if (len(x) > 40 or len(x) < 4000)
to if (len(x) > 40 and len(x) < 4000)
-- using and
instead of or
will give you the result you're looking for.
for i, x in enumerate(ts): #NEW: enumerate ts to get the index of every iteration (stored as i)
if (x[0] != '>'):
if (len(x) > 40 and len(x) < 4000):
tsf.write('%s\n'%(ts[i-1])) #NEW: write the ID number found on preceding line
tsf.write('%s\n'%(x))
Upvotes: 1
Reputation: 1633
Try this, simple and easy to understand. It does not load the entire file into memory, instead iterates over the file line by line.
tsf=open('output.txt','w') # open the output file
with open("yourfile",'r') as ts: # open the input file
for line in ts: # iterate over each line of input file
line=line.strip() # removes all whitespace at the start and end, including spaces, tabs, newlines and carriage returns.
if line[0]=='>': # if line is an ID
continue # move to the next line
else: # otherwise
if (len(line)>40) or (len(line)<4000): # if line is in required length
tsf.write('%s\n'%line) # write to output file
tsf.close() # done
print "OK!"
FYI, you could also use awk for a one line solution if working in unix environment:
cat yourinputfile.txt | grep -v '>' | awk 'length($0)>=40' | awk 'length($0)<=4000' > youroutputfile.txt
Upvotes: 0