Octane
Octane

Reputation: 198

Newbie in regex patterns. How to capture multiple lines?

Im quite new to regex patterns. Im having difficulty parsing a text file and returning the matches per paragraph. So basically every paragraph is unique.

Here is my example text file

A quick brown
fox jumps over
the lazy dog;
1234;

Here is
the second paragraph
123141

I want is matches[0] to be: #A quick brown fox jumps over the lazy dog; 1234;

matches[1] to be: #Here is the second paragraph 123141

I've tried

regex = re.compile(r"(.*\n)\n", re.MULTILINE)
   with open(file_dir, "r") as file:
      matches = regex.findall(file.read())
print matches

But the result is ['1234;\n']. It doesnt capture the whole paragraph, and it didnt capture the second as well. What is the most efficient way of doing this?

Upvotes: 1

Views: 2397

Answers (2)

Booboo
Booboo

Reputation: 44043

Try (\S[\s\S]*?)(?:\n\n|$):

  1. \S Matches a non-whitespace character
  2. [\s\S]*? Match 0 or more whitespace or non-whitespace characters, i.e. any type of character including newline non-greedily. Items 1 and 2 are in capture group 1.
  3. (?:\n\n|$) Matches two successive newline characters or $ (which matches either the end of string or the newline before the end of string) in a non-capture group.

Regex Demo

The code:

import re

s = """A quick brown
fox jumps over
the lazy dog;
1234;

Here is
the second paragraph
123141"""

matches = re.findall(r'(\S[\s\S]*?)(?:\n\n|$)', s)
print(matches)

Prints:

['A quick brown\nfox jumps over\nthe lazy dog;\n1234;', 'Here is\nthe second paragraph\n123141']

Alternatively, you can use:

\S(?:(?!\n\n)[\s\S])*

Which uses a negative looahead assertion and has about the same cost as the previous regex. This regex first looks for a non-whitespace character and then as long as the following input stream does not contain two successive newline characters will continue to scan one more character.

Regex Demo

Upvotes: 2

totok
totok

Reputation: 1500

This is a good start :

(?:.+\s)+

Test it here

Test code:

import re

regex = r"(?:.+\s)+"

test_str = ("A quick brown\n"
    "fox jumps over\n"
    "the lazy dog;\n"
    "1234;\n\n"
    "Here is\n"
    "the second paragraph\n"
    "123141")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Output:

Match 1 was found at 0-49: A quick brown
fox jumps over
the lazy dog;
1234;

Match 2 was found at 50-79: Here is
the second paragraph

You can see that the last line of the last paragraph is truncated. To avoid this, before matching the regex, add a \n at the end of the string, so the regex can detect the end of the paragraph: test_str += '\n'

You can try it here without the \n at the end, and here with it.

Upvotes: -1

Related Questions