Ishu Gupta
Ishu Gupta

Reputation: 1101

Find the first occurrence of a pattern match in a file and then stop traversing the data

I have a file having such data

D11/22/1984 D 123 q423 ooo 11/22/1987
R11/22/1985 123 q423 ooo 11/22/1987
D12/24/1986 123 q423 ooo 11/22/1987
511/27/1987 123 q423 ooo 11/22/1987
D18/29/1988 123 q423 ooo 11/22/1987

I need to pick the first occurrence of a record matching the pattern ^D(\d{2}/\d{2}/\d{4}) and then break and stop traversing through rest of the file.

For example in the data mentioned above I just want to pick the value D11/22/1984 and not D12/24/1986 or D18/29/1988.

I want to write it in Python using the re module.

Upvotes: 1

Views: 6564

Answers (4)

chapelo
chapelo

Reputation: 2562

This regex captures only the first occurrence:

import re

filedata = '''
D11/22/1984 D 123 q423 ooo 11/22/1987
R11/22/1985 123 q423 ooo 11/22/1987
D12/24/1986 123 q423 ooo 11/22/1987
511/27/1987 123 q423 ooo 11/22/1987
D18/29/1988 123 q423 ooo 11/22/1987 
'''

print(list(re.findall(r'^D(\d{2}/\d{2}/\d{4})?.*', filedata, flags=re.M|re.S)))
# ['12/24/1986']

Furthermore, re.search scans the string and returns only the first occurrence found and stops scanning (maybe this is what you are looking for):

print(re.search(r'^D(\d{2}/\d{2}/\d{4})', filedata, flags=re.M|re.S).groups())
# ('11/22/1984',)
# no need of the (...)?.* Your original pattern can be used.

With this regex, findall finds... all occurrences:

print(list(re.findall(r'^D(\d{2}/\d{2}/\d{4})', filedata, flags=re.M|re.S)))
# ['11/22/1984', '12/24/1986', '18/29/1988']

Upvotes: 1

Jon Clements
Jon Clements

Reputation: 142156

You can build a generator over your file-obj (the following assumes it's called f) which applies your re.match, then take the first occurrence of a match, eg:

matches = (re.match('D(\d{2}/\d{2}/\d{4})', line) for line in f)
first_match = next((match.group(1) for match in matches if match), None)

If you get None, then no matches were found. You can also extend this to easily take n many matches:

from itertools import islice, ifilter
first5 = list(islice(ifilter(None, matches), 5))

If you then get an empty list, no matches were found.

Upvotes: 3

user2555451
user2555451

Reputation:

You can use a function that iterates over the file object with a for-loop and the returns when it finds the first match:

import re
def func():
    with open('/path/to/file.txt') as f: # Open the file (auto-close it too)
        for line in f: # Go through the lines one at a time
            m = re.match('D(\d{2}/\d{2}/\d{4})', line) # Check each line
            if m: # If we have a match...
                return m.group(1) # ...return the value

Iterating over a file object yields its lines one-by-one. So, we only check as many lines as necessary.

Also, I removed the ^ from your pattern since re.match already matches from the start of the string by default.


If you already have a file object open, just remove the with-statement and pass the file as an argument to the function:

import re
def func(f):
    for line in f: # Go through the lines one at a time
        m = re.match('D(\d{2}/\d{2}/\d{4})', line) # Check each line
        if m: # If we have a match...
            return m.group(1) # ...return the value

Just remember to close the file when you are done with it.

Upvotes: 1

alpha bravo
alpha bravo

Reputation: 7948

you could consume the rest of your data like so

^D(\d{2}/\d{2}/\d{4})[\s\S]+

Demo

Upvotes: 1

Related Questions