prepagam
prepagam

Reputation: 55

How can I print only the lines with five or more matches of a regular expression?

I'm trying to use regular expressions in Python to parse a large tab delimited text file line by line, and print the lines where the line contains 5 or more instances of 0/1 or 1/1.

My script is almost there, but I am struggling with the 5 or more instances.

This will print the lines with one match.

import re  
f = open ("infile.txt", "r")  
out = open("outfile.txt", "w")  

for line in f:  
    if re.match(r"(.*)(0|1)/(1)(.*)", line):  
        print >> out, line,

To print only lines that have 5 or more matches I tried findall and finditer as follows but they didn't work:

for line in f:  
    x = len(re.findall(r"(.*)(0|1)/(1)(.*)", line)):  
    if x > 5:  
        print >> out, line,

Can anyone help me with this?

Here is an example of one line from the text file (all spaces are tabs in the file):

X 6529 . C A,G PASS AC=4,2;AF=0.6777 1/1:0,20 0/1:0,16 0/1:0,16 0/0:4,16 0/0:3,1 

Upvotes: 1

Views: 170

Answers (2)

Blckknght
Blckknght

Reputation: 104702

I think there are two solutions that can work. The first sticks with your current idea of doing a findall with a pattern that matches one occurance of 0/1 or 1/1. The second is to make a single pattern that will match that text five times at once.

For the first approach, I think all you need to do is get rid of the .astericssymbol bits of your current pattern (I don't really understand why that's spelled out, rather than .*, but it's wrong either way). Here's code that should work:

for line in f:
    matches = re.findall(r'[01]/1', line)
    if len(matches) >= 5:
        print >> out, line,

I've eliminated the capturing groups, which were not needed and might have made things a bit slower.

For the second approach, you can make just a single call to re.search, which will return a non-None value only if it finds 5 matches of the appropriate sort. The pattern uses the repetition syntax, {N}, to find exactly N copies of the preceding pattern. In this case we will need to match the extra characters in between the 0/1 or 1/1 bits, so the pattern has a .* added. Since we want to repeat the whole thing, we need to wrap it in a non-capturing group:

for line in f:
    if re.search(r'(:?[01]/1.*){5}', line):
        print >> out, line,

Upvotes: 0

gonz
gonz

Reputation: 5276

You can use {5,} to match a pattern 5 or more times

import re
f = open ("data.txt", "r")
out = open("dataout.txt", "w")

for line in f:
    if re.match(r"(.*([01]/1.*){5,}", line):
        print >> out, line,

Upvotes: 1

Related Questions