user3224522
user3224522

Reputation: 1151

Filter off lines using Regex

My big tab-delimited file with some text before and after,EDITED

chr3Av1G678.1 chr2Bv1G678.9
chr1Av1G978.6 chr1Bv1G456.1
chr2Av1G123.4 chr2Bv1G678.3
chr1Av1G456.0 chr2Av1G784.22

How to filter off from file 1A-1B and 2A-2B?So that I have only 3A 2B and 1A 2A

import re
import sys
f=open('input.txt','r') 
r=open('output.txt','w')
for line in f.readlines():
    line = line.split()
    if not (?) re.search(r'text1Av1', line[0]) and not (?) re.search(r'text1Bv1', line[1]):
        r.write("\t".join(line)+"\n")
f.close()
r.close() 

Upvotes: 1

Views: 1030

Answers (3)

KCoon
KCoon

Reputation: 134

Simple solution if you want keep your text and only filter out the two lines.

UPDATE Regex!

import re
import sys
with open('input.txt','r') as f, open('output.txt','w') as r:
    for line in f:
        if None is re.search(r'^chr[12]Av1G\d+\.\d+\s*chr[12]Bv1G\d+\.\d+$',line):
            r.write(line)

Upvotes: 2

asongtoruin
asongtoruin

Reputation: 10359

Assuming you want to keep lines where you have a number then A or B, then immediately following (e.g. after a tab) a different number followed by A or B, the following should work:

import re

with open('input.txt', 'r') as f:
    read_lines = f.readlines()

with open('output.txt', 'w') as o:
    for line in read_lines:
        get_digits = re.match(r'.*(\d)+[AB]\s+(\d)+[AB].*', line, re.DOTALL)
        if get_digits:
            if get_digits.group(1) != get_digits.group(2):
                o.writelines(line)

This will write to output.txt the lines which contain 3A 2B and 1A 2A.

To generalise this further, you could change the regex to:

re.match(r'.*(\d)+[A-Z]\s+(\d)+[A-Z].*', line, re.DOTALL)

Which would allow for any capital letters, not just A and B.

Upvotes: 1

Mohammad Yusuf
Mohammad Yusuf

Reputation: 17074

You can do it like so:

import re

with open('input', 'r') as f, open('output', 'w') as f2:
    ftemp = f.read()
    for a in range(1,4):
        res = '-'.join(sorted(set(re.findall(r'{}[A-Z]'.format(a), ftemp))))
        print res
        f2.write(res)

Output of print res:

1A-1B
2A-2B
3A

Step:

Create a range() object with the numbers you want to capture from the file. Then search for those numbers + 1 caps alphabet in the file.

Upvotes: 1

Related Questions