Reputation: 1151
My big tab-delimited file with some text before and after,EDITED
chr3Av1G678.1 chr2Bv1G678.9
chr1Av1G978.6 chr1Bv1G456.1
chr2Av1G123.4 chr2Bv1G678.3
chr1Av1G456.0 chr2Av1G784.22
How to filter off from file 1A-1B and 2A-2B?So that I have only 3A 2B and 1A 2A
import re
import sys
f=open('input.txt','r')
r=open('output.txt','w')
for line in f.readlines():
line = line.split()
if not (?) re.search(r'text1Av1', line[0]) and not (?) re.search(r'text1Bv1', line[1]):
r.write("\t".join(line)+"\n")
f.close()
r.close()
Upvotes: 1
Views: 1030
Reputation: 134
Simple solution if you want keep your text and only filter out the two lines.
UPDATE Regex!
import re
import sys
with open('input.txt','r') as f, open('output.txt','w') as r:
for line in f:
if None is re.search(r'^chr[12]Av1G\d+\.\d+\s*chr[12]Bv1G\d+\.\d+$',line):
r.write(line)
Upvotes: 2
Reputation: 10359
Assuming you want to keep lines where you have a number then A or B, then immediately following (e.g. after a tab) a different number followed by A or B, the following should work:
import re
with open('input.txt', 'r') as f:
read_lines = f.readlines()
with open('output.txt', 'w') as o:
for line in read_lines:
get_digits = re.match(r'.*(\d)+[AB]\s+(\d)+[AB].*', line, re.DOTALL)
if get_digits:
if get_digits.group(1) != get_digits.group(2):
o.writelines(line)
This will write to output.txt
the lines which contain 3A 2B
and 1A 2A
.
To generalise this further, you could change the regex to:
re.match(r'.*(\d)+[A-Z]\s+(\d)+[A-Z].*', line, re.DOTALL)
Which would allow for any capital letters, not just A and B.
Upvotes: 1
Reputation: 17074
You can do it like so:
import re
with open('input', 'r') as f, open('output', 'w') as f2:
ftemp = f.read()
for a in range(1,4):
res = '-'.join(sorted(set(re.findall(r'{}[A-Z]'.format(a), ftemp))))
print res
f2.write(res)
Output of print res
:
1A-1B
2A-2B
3A
Step:
Create a range() object with the numbers you want to capture from the file. Then search for those numbers + 1 caps alphabet in the file.
Upvotes: 1