Reputation: 55
I'm trying to use regular expressions in Python to parse a large tab delimited text file line by line, and print the lines where the line contains 5 or more instances of 0/1
or 1/1
.
My script is almost there, but I am struggling with the 5 or more instances.
This will print the lines with one match.
import re
f = open ("infile.txt", "r")
out = open("outfile.txt", "w")
for line in f:
if re.match(r"(.*)(0|1)/(1)(.*)", line):
print >> out, line,
To print only lines that have 5 or more matches I tried findall
and finditer
as follows but they didn't work:
for line in f:
x = len(re.findall(r"(.*)(0|1)/(1)(.*)", line)):
if x > 5:
print >> out, line,
Can anyone help me with this?
Here is an example of one line from the text file (all spaces are tabs in the file):
X 6529 . C A,G PASS AC=4,2;AF=0.6777 1/1:0,20 0/1:0,16 0/1:0,16 0/0:4,16 0/0:3,1
Upvotes: 1
Views: 170
Reputation: 104702
I think there are two solutions that can work. The first sticks with your current idea of doing a findall
with a pattern that matches one occurance of 0/1
or 1/1
. The second is to make a single pattern that will match that text five times at once.
For the first approach, I think all you need to do is get rid of the .astericssymbol
bits of your current pattern (I don't really understand why that's spelled out, rather than .*
, but it's wrong either way). Here's code that should work:
for line in f:
matches = re.findall(r'[01]/1', line)
if len(matches) >= 5:
print >> out, line,
I've eliminated the capturing groups, which were not needed and might have made things a bit slower.
For the second approach, you can make just a single call to re.search
, which will return a non-None
value only if it finds 5 matches of the appropriate sort. The pattern uses the repetition syntax, {N}
, to find exactly N
copies of the preceding pattern. In this case we will need to match the extra characters in between the 0/1
or 1/1
bits, so the pattern has a .*
added. Since we want to repeat the whole thing, we need to wrap it in a non-capturing group:
for line in f:
if re.search(r'(:?[01]/1.*){5}', line):
print >> out, line,
Upvotes: 0
Reputation: 5276
You can use {5,} to match a pattern 5 or more times
import re
f = open ("data.txt", "r")
out = open("dataout.txt", "w")
for line in f:
if re.match(r"(.*([01]/1.*){5,}", line):
print >> out, line,
Upvotes: 1