Reputation: 55
Example string 1:
7.2.P.8.1
Summary and Conclusion
A stability study with two batches was carried out.
Example string 2:
7.2.S.1.2
Structure
Not applicable as the substance is not present.
I want to write a regex to fetch the first line after this form (7.2.P.8.1 ) or (7.2.S.1.2 ) or (8-3-1-P-2) or any other format(either everything will be separated by . or -) and retrieve it. So from the first intance I need as output (Summary and Conclusion) and from the second instance (Structure). The word 'Example String' wont be part of the file content and is just given to show an example.
Maybe occasionally the format will be like:
9.2.P.8.1 Summary and Conclusion
A stability study with two batches was carried out.
In this case also, I want to retrieve as output : Summary and Conclusion
Note: I only want to retrieve the first matching pattern from the file and not all matches, so my code should break after finding the first matching pattern. How can I do this efficiently.
Code till now:
import re
def func():
with open('/path/to/file.txt') as f: # Open the file (auto-close it too)
for line in f: # Go through the lines one at a time
m = re.match('\d+(?:[.-]\w+)*\s*', line) # Check each line
if m: # If we have a match...
return m.group(1) # ...return the value
Upvotes: 1
Views: 1040
Reputation: 627607
You may use
import re
rx = re.compile(r'\d+(?:[.-]\w+)*\s*(\S.*)?$')
found = False
with open('/path/to/file.txt', 'r') as f:
for line in f:
if not found: # If the required line is not found yet
m = rx.match(line.strip()) # Check if matching line found
if m:
if m.group(1): # If Group 1 is not empty
print(m.group(1)) # Print it
break # Stop processing
else: # Else, the next blank line is necessary
found=True # Set found flag to True
else:
if not line.strip(): # Skip blank line
pass
else:
print(line.strip()) # Else, print the match
break # Stop processing
See the Python demo and the regex demo.
NOTES
The \d+(?:[.-]\w+)*\s*(\S.*)?$
regex searches for 1+ digits and then 0 or more repetitions of .
or -
followed with 1+ word chars, and then tries to match 0+ whitespaces and then capture into Group 1 any non-whitespace char followed with any 0+ chars up to the line end. If Group 1 is not empty, the match is found and break
stops processing.
Else, the found
boolean flag is set to True
and the next non-blank line is returned.
Upvotes: 2