Reputation: 7679
I have the following file:
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 168 30
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30
gene.100079.0.5.p3 transcript:OIS96097 82.2 169 30
gene.100079.0.3.p1 transcript:OIS96097 82.2 169 30
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 35
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 40
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 40
and I get the following output:
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 168 30
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 35
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 40
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 40
I just wonder whether there are ways to simplify the below code in terms how to deal with end of file
try:
lineParts = line.rstrip().split('\t')
except IndexError:
continue
and
if dataStorage: # check if Dict is **not** empty
output(tmp_id, dataStorage, out_fn)
?
def output(tmp_id, dataDict, out_fn):
for isoformID in dataDict.keys():
mx = max(dataDict[isoformID], key=lambda x: int(x[4]))[4]
mx_values = [d for d in dataDict[isoformID] if d[4] == mx]
for mx_value in mx_values:
out_fn.write('\t'.join(mx_value) + '\n')
del dataDict[isoformID]
dataDict[tmp_id].append(lineParts)
return dataDict
dataStorage = defaultdict(list)
with open("data/datap-bigcov.tsv") as data_fn, open("data/datap-bigcov-out.tsv", 'w') as out_fn:
for line in data_fn:
try:
lineParts = line.rstrip().split('\t')
except IndexError:
continue
if lineParts[0].startswith("gene"):
split_id = lineParts[0].split('.')
tmp_id = split_id[0] + "." + split_id[1]
else:
tmp_id = lineParts[0]
if not dataStorage: # check if Dict is empty
dataStorage[tmp_id].append(lineParts)
elif tmp_id in dataStorage:
dataStorage[tmp_id].append(lineParts)
else:
dataStorage = output(tmp_id, dataStorage, out_fn)
if dataStorage: # check if Dict is **not** empty
output(tmp_id, dataStorage, out_fn)
Upvotes: 0
Views: 50
Reputation: 31389
You're already dealing with the end of the file here:
for line in data_fn:
...
This automatically stops after the last line is read. However, it appears you want to stop when you encounter the first empty line. A very straightforward way would be:
for line in data_fn:
line = line.rstrip()
if not line:
break
...
Or, if you don't like using/abusing the fact that ''
evaluates to False
:
for line in data_fn:
if line == '\n':
break
...
Another way to approach it would be to simply skip empty lines (allowing for empty lines in the middle of the file as well:
for line in data_fn:
line = line.rstrip()
if line:
...
By the way, you probably don't want to call a variable referencing a file data_fn
, since the name suggests this is the name of the file, but it's actually the reference to the file.
So, data_f
, just f
or data
would all be more appropriate names.
You asked a follow-up question in the comments: "Thank you, how would you replace this condition to guarantee that the last id get written to file: if blastStorage: output(tmp_id, blastStorage, out_fn)
" - I could try to answer that question, but I feel that's going down a rabbit hole with still more questions. There's quite a few programming and style mistakes in your code, it goes a bit too far to address all of them individually.
Your overall problem is that your code is a bit more convoluted than it needs to be. Often, if you've solved a problem once, it helps to step back and decide if you now understand the problem well enough to write a better (and simpler) solution.
The problem you're solving: you have an input file of records; each record has an associated identifier (the first two parts of the first column split over periods); you want to create an output file with, for each identifier in the input file, all the records where some value is equal to the maximum same value for all records with that identifier. You appear to want to process the file in a single pass (a simpler, but perhaps slower solution could do a double pass).
Your code seems to suggest you expect records with matching identifiers to be consecutive, so you don't have to keep data around for later matches. From the way you deal with the maximum value
A working (complete, and far simpler) solution, using those implied assumptions, would be:
with open("data/datap-bigcov.tsv") as f_in, open("data/datap-bigcov-out.tsv", 'w') as f_out:
current = ('', 0)
max_lines = []
for line in f_in:
line_parts = line.rstrip().split()
rec_id = '.'.join(line_parts[0].split('.')[:2])
value = int(line_parts[4])
# id and value match, just add it
if (rec_id, value) == current:
max_lines.append(line)
else:
# id doesn't match
if rec_id != current[0]:
# whatever was collected so far should be written first
f_out.writelines(max_lines)
# value doesn't match (and is greater)
elif value > current[1]:
# if rec_id matches, but value doesn't and isn't greater, just skip it
continue
# in both other cases, rec_id and value are the values we're after for now
current = (rec_id, value)
# start a new set with the greater max
max_lines = [line]
f_out.writelines(max_lines)
The statement f_out.writelines(max_lines)
is in there twice, to ensure the final id also gets written to file, but I think the overall code is so much simpler that the duplicate call isn't really a concern.
Also note that the first time it gets called, it's called with an empty max_lines
, but that does what's expected (nothing) and it avoids having to check for it to be empty.
Only 17 lines of Python, without sacrificing clarity or speed.
And then a final note: this type of stuff is generally dealt with very well by the pandas
library. Here's a solution to your problem in only a few lines using pandas
:
from pandas import read_csv
df = read_csv('data/datap-bigcov.tsv', sep='\t', header=None)
df['id'] = df.apply(lambda rec: '.'.join(rec[0].split('.')[:2]), axis=1)
result = df[df[4] == df.groupby(by='id')[4].transform('max')]
result.to_csv('data/datap-bigcov-out.tsv', sep='\t', header=None)
I feel that using pandas
does sacrifice clarity (at least it requires quite a good understanding of pandas
to be able to get the code), but it performs far better on large datasets and it's pretty much the industry standard, so it's worth learning it.
Upvotes: 1