user977828
user977828

Reputation: 7679

simplify dealing with end of file

I have the following file:

chr11_pilon3.g3568.t1   transcript:OIS96097 82.2    168 30
chr11_pilon3.g3568.t2   transcript:OIS96097 82.2    169 30
gene.100079.0.5.p3  transcript:OIS96097 82.2    169 30
gene.100079.0.3.p1  transcript:OIS96097 82.2    169 30
gene.100079.0.0.p1  transcript:OIS96097 82.2    169 35
gene.100080.0.3.p1  transcript:OIS96097 82.2    169 40
gene.100080.0.0.p1  transcript:OIS96097 82.2    169 40

and I get the following output:

chr11_pilon3.g3568.t1   transcript:OIS96097 82.2    168 30
chr11_pilon3.g3568.t2   transcript:OIS96097 82.2    169 30
gene.100079.0.0.p1  transcript:OIS96097 82.2    169 35
gene.100080.0.3.p1  transcript:OIS96097 82.2    169 40
gene.100080.0.0.p1  transcript:OIS96097 82.2    169 40

I just wonder whether there are ways to simplify the below code in terms how to deal with end of file

            try:
                lineParts = line.rstrip().split('\t')
            except IndexError:
                continue

and

        if dataStorage:  # check if Dict is **not** empty
            output(tmp_id, dataStorage, out_fn)

?

def output(tmp_id, dataDict, out_fn):
    for isoformID in dataDict.keys():
        mx = max(dataDict[isoformID], key=lambda x: int(x[4]))[4]
        mx_values = [d for d in dataDict[isoformID] if d[4] == mx]
        for mx_value in mx_values:
            out_fn.write('\t'.join(mx_value) + '\n')

    del dataDict[isoformID]
    dataDict[tmp_id].append(lineParts)
    return dataDict

dataStorage = defaultdict(list)

with open("data/datap-bigcov.tsv") as data_fn, open("data/datap-bigcov-out.tsv", 'w') as out_fn:
    for line in data_fn:
        try:
            lineParts = line.rstrip().split('\t')
        except IndexError:
            continue
        if lineParts[0].startswith("gene"):
            split_id = lineParts[0].split('.')
            tmp_id = split_id[0] + "." + split_id[1]
        else:
            tmp_id = lineParts[0]

        if not dataStorage:  # check if Dict is empty
            dataStorage[tmp_id].append(lineParts)
        elif tmp_id in dataStorage:
            dataStorage[tmp_id].append(lineParts)
        else:
            dataStorage = output(tmp_id, dataStorage, out_fn)
    if dataStorage:  # check if Dict is **not** empty
        output(tmp_id, dataStorage, out_fn)

Upvotes: 0

Views: 50

Answers (1)

Grismar
Grismar

Reputation: 31389

You're already dealing with the end of the file here:

for line in data_fn:
    ...

This automatically stops after the last line is read. However, it appears you want to stop when you encounter the first empty line. A very straightforward way would be:

for line in data_fn:
    line = line.rstrip()
    if not line:
        break
    ...

Or, if you don't like using/abusing the fact that '' evaluates to False:

for line in data_fn:
    if line == '\n':
        break
    ...

Another way to approach it would be to simply skip empty lines (allowing for empty lines in the middle of the file as well:

for line in data_fn:
    line = line.rstrip()
    if line:
        ...

By the way, you probably don't want to call a variable referencing a file data_fn, since the name suggests this is the name of the file, but it's actually the reference to the file.

So, data_f, just f or data would all be more appropriate names.

You asked a follow-up question in the comments: "Thank you, how would you replace this condition to guarantee that the last id get written to file: if blastStorage: output(tmp_id, blastStorage, out_fn)" - I could try to answer that question, but I feel that's going down a rabbit hole with still more questions. There's quite a few programming and style mistakes in your code, it goes a bit too far to address all of them individually.

Your overall problem is that your code is a bit more convoluted than it needs to be. Often, if you've solved a problem once, it helps to step back and decide if you now understand the problem well enough to write a better (and simpler) solution.

The problem you're solving: you have an input file of records; each record has an associated identifier (the first two parts of the first column split over periods); you want to create an output file with, for each identifier in the input file, all the records where some value is equal to the maximum same value for all records with that identifier. You appear to want to process the file in a single pass (a simpler, but perhaps slower solution could do a double pass).

Your code seems to suggest you expect records with matching identifiers to be consecutive, so you don't have to keep data around for later matches. From the way you deal with the maximum value

A working (complete, and far simpler) solution, using those implied assumptions, would be:

with open("data/datap-bigcov.tsv") as f_in, open("data/datap-bigcov-out.tsv", 'w') as f_out:
    current = ('', 0)
    max_lines = []
    for line in f_in:
        line_parts = line.rstrip().split()
        rec_id = '.'.join(line_parts[0].split('.')[:2])
        value = int(line_parts[4])
        # id and value match, just add it
        if (rec_id, value) == current:
            max_lines.append(line)
        else:
            # id doesn't match
            if rec_id != current[0]:
                # whatever was collected so far should be written first
                f_out.writelines(max_lines)
            # value doesn't match (and is greater)
            elif value > current[1]:
                # if rec_id matches, but value doesn't and isn't greater, just skip it
                continue
            # in both other cases, rec_id and value are the values we're after for now
            current = (rec_id, value)
            # start a new set with the greater max
            max_lines = [line]
    f_out.writelines(max_lines)

The statement f_out.writelines(max_lines) is in there twice, to ensure the final id also gets written to file, but I think the overall code is so much simpler that the duplicate call isn't really a concern.

Also note that the first time it gets called, it's called with an empty max_lines, but that does what's expected (nothing) and it avoids having to check for it to be empty.

Only 17 lines of Python, without sacrificing clarity or speed.

And then a final note: this type of stuff is generally dealt with very well by the pandas library. Here's a solution to your problem in only a few lines using pandas:

from pandas import read_csv

df = read_csv('data/datap-bigcov.tsv', sep='\t', header=None)
df['id'] = df.apply(lambda rec: '.'.join(rec[0].split('.')[:2]), axis=1)
result = df[df[4] == df.groupby(by='id')[4].transform('max')]
result.to_csv('data/datap-bigcov-out.tsv', sep='\t', header=None)

I feel that using pandas does sacrifice clarity (at least it requires quite a good understanding of pandas to be able to get the code), but it performs far better on large datasets and it's pretty much the industry standard, so it's worth learning it.

Upvotes: 1

Related Questions