Reputation: 198
Im quite new to regex patterns. Im having difficulty parsing a text file and returning the matches per paragraph. So basically every paragraph is unique.
Here is my example text file
A quick brown
fox jumps over
the lazy dog;
1234;
Here is
the second paragraph
123141
I want is matches[0] to be: #A quick brown fox jumps over the lazy dog; 1234;
matches[1] to be: #Here is the second paragraph 123141
I've tried
regex = re.compile(r"(.*\n)\n", re.MULTILINE)
with open(file_dir, "r") as file:
matches = regex.findall(file.read())
print matches
But the result is ['1234;\n']. It doesnt capture the whole paragraph, and it didnt capture the second as well. What is the most efficient way of doing this?
Upvotes: 1
Views: 2397
Reputation: 44043
Try (\S[\s\S]*?)(?:\n\n|$)
:
\S
Matches a non-whitespace character[\s\S]*?
Match 0 or more whitespace or non-whitespace characters, i.e. any type of character including newline non-greedily. Items 1 and 2 are in capture group 1.(?:\n\n|$)
Matches two successive newline characters or $ (which matches either the end of string or the newline before the end of string) in a non-capture group.The code:
import re
s = """A quick brown
fox jumps over
the lazy dog;
1234;
Here is
the second paragraph
123141"""
matches = re.findall(r'(\S[\s\S]*?)(?:\n\n|$)', s)
print(matches)
Prints:
['A quick brown\nfox jumps over\nthe lazy dog;\n1234;', 'Here is\nthe second paragraph\n123141']
Alternatively, you can use:
\S(?:(?!\n\n)[\s\S])*
Which uses a negative looahead assertion and has about the same cost as the previous regex. This regex first looks for a non-whitespace character and then as long as the following input stream does not contain two successive newline characters will continue to scan one more character.
Upvotes: 2
Reputation: 1500
This is a good start :
(?:.+\s)+
Test it here
Test code:
import re
regex = r"(?:.+\s)+"
test_str = ("A quick brown\n"
"fox jumps over\n"
"the lazy dog;\n"
"1234;\n\n"
"Here is\n"
"the second paragraph\n"
"123141")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Output:
Match 1 was found at 0-49: A quick brown
fox jumps over
the lazy dog;
1234;
Match 2 was found at 50-79: Here is
the second paragraph
You can see that the last line of the last paragraph is truncated. To avoid this, before matching the regex, add a \n
at the end of the string, so the regex can detect the end of the paragraph:
test_str += '\n'
You can try it here without the \n
at the end, and here with it.
Upvotes: -1