bzmby
bzmby

Reputation: 69

filteration using separators in python

I have many lines like the following:

>ENSG00000003137|ENST00000001146|CYP26B1|72374964|72375167|4732
CGTCGTTAACCGCCGCCATGGCTCCCGCAGAGGCCGAGT
>ENSG00000001630|ENST00000003100|CYP51A1|91763679|91763844|3210
TCCCGGGAGCGCGCTTCTGCGGGATGCTGGGGCGCGAGCGGGACTGTTGACTAAGCTTCG
>ENSG00000003137|ENST00000412253|CYP26B1|72370133;72362405|72370213;72362548|4025
AGCCTTTTTCTTCGACGATTTCCG

In this example ENSG00000003137 is name and 4732 which is the last one is length. as you see some names are repeated but they have different length. I want to make a new file in which I only have those with the longest length. meaning the results would be like this:

>ENSG00000003137|ENST00000001146|CYP26B1|72374964|72375167|4732
CGTCGTTAACCGCCGCCATGGCTCCCGCAGAGGCCGAGT
>ENSG00000001630|ENST00000003100|CYP51A1|91763679|91763844|3210
TCCCGGGAGCGCGCTTCTGCGGGATGCTGGGGCGCGAGCGGGACTGTTGACTAAGCTTCG

I have made this code to split but don't know how to make the file I want:

file = open(“file.txt”, “r”)
for line in file:
   if line.startswith(“>”):
       line = line.split(“|”)

Upvotes: 0

Views: 76

Answers (2)

nibo ai
nibo ai

Reputation: 73

you have to do two types of handling in the loop, one that compares your 'length', and one that stores the CGTA when its needed. I wrote an example for you that reads those into dicts:

file = open("file.txt", "r")
myDict = {}
myValueDict = {}
action = 'remember'
geneDict = {}

for line in file:
    if line.startswith(">"):
        line = line.rstrip().split("|")
        line_name = line[0]
        line_number = int(line[-1])
        if line_name in myValueDict:
            if myValueDict[line_name] < line_number:
                action = 'remember'
                myValueDict[line_name] = line_number
                myDict[line_name] = line
            else:
                action = 'forget'
        else:
            myDict[line_name] = line
            myValueDict[line_name] = line_number
    else:
        if action == 'remember':
            geneDict[line_name] = line.rstrip()


for key in myDict:
    print(myDict[key])

for key in geneDict:
    print(geneDict[key])

this ignores the lower length items. you can now store those dicts any way you want.

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1123740

You'll need to read the file twice; the first time round, track the largest size per entry:

largest = {}
with open(inputfile) as f:
    for line in f:
        if line.startswith('>'):
            parts = line.split('|')
            name, length = parts[0][1:], int(parts[-1])
            largest[name] = max(length, largest.get(name, -1))

then write out the copy in a second pass, but only those sections whose name and length match the extracted largest length from the first pass:

with open(inputfile) as f, open(outpufile, 'w') as out:
    copying = False
    for line in f:
        if line.startswith('>'):
            parts = line.split('|')
            name, length = parts[0][1:], int(parts[-1])
            copying = largest[name] == length
        if copying:
            out.write(line)

Upvotes: 1

Related Questions