codingIsInteresting
codingIsInteresting

Reputation: 57

How to extract the list of text between the pattern using RegEx?

I have text like:

05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC

COM
Payable: 05/06/2021
QUALIFIED DIVIDENDS 23.50 

ATVI - 0.00 23.50 (9,425.77)

05/13/21 05/13/21 Margin Div/Int - Income APPLE INC
COM
Payable: 05/13/2021
QUALIFIED DIVIDENDS 6.16 

AAPL - 0.00 6.16 (9,419.61)

05/28/21 05/28/21 Margin Div/Int - Income STARBUCKS CORP
COM
Payable: 05/28/2021
QUALIFIED DIVIDENDS 18.00 

SBUX - 0.00 18.00 (9,401.61)

05/28/21 05/28/21 Margin Div/Int - Expense MARGIN INTEREST CHARGE
Payable: 05/28/2021 

 - - 0.00 (73.03) (9,474.64)

I want to extract individual records, such as:

05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC

COM
Payable: 05/06/2021
QUALIFIED DIVIDENDS 23.50 

ATVI - 0.00 23.50 (9,425.77)

and

05/13/21 05/13/21 Margin Div/Int - Income APPLE INC
COM
Payable: 05/13/2021
QUALIFIED DIVIDENDS 6.16 

AAPL - 0.00 6.16 (9,419.61)

and

05/28/21 05/28/21 Margin Div/Int - Expense MARGIN INTEREST CHARGE
Payable: 05/28/2021 

 - - 0.00 (73.03) (9,474.64)

Here the pattern of each record should start with date(\d+/\d+/\d) and end with (\n\n\d+/\d+/\d)

I have tried like (re.findall(r'\d+/\d+/\d(.*?)\n\n\d+/\d+/\d+',a)). But it doesn't works as expected

Upvotes: 0

Views: 72

Answers (3)

The fourth bird
The fourth bird

Reputation: 163207

You can match a date like pattern at the start of the string, and repeat all lines that do not start with matching a date like pattern.

^\d+/\d+/\d+ .*(?:\n(?!^\d+/\d+/\d+ ).*)*

The pattern matches:

  • ^ Start of string
  • \d+/\d+/\d+ Match a date like pattern and a space
  • .* Match the rest of the line
  • (?: Non capture group
    • \n(?!^\d+/\d+/\d+ ).* Match a newline and the rest of the line if it does not start with a date like pattern
  • )* Close the non capture group and optionally repeat it

See a regex demo and a Python demo.

Use can use re.findall to get all the matches:

import re

pattern = r"^\d+/\d+/\d+ .*(?:\n(?!^\d+/\d+/\d+ ).*)*"
 
s = ("05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC\n\n....")
 
print(re.findall(pattern, s, re.M))

Upvotes: 1

Cary Swoveland
Cary Swoveland

Reputation: 110665

You can match

.+?(?=\s*(?:\d{2}\/\d{2}\/\d{2} ){2}|$)

with 'g' ("global") and 's' ("single line" or "dot-all") flags set. 's' causes periods to match all characters, including line terminators.

Demo

The regular expression can be broken down as follows.

.+?                        # match one or more chars, lazily
(?=                        # begin a positive lookahead
  \s*                      # match zero or more whitespaces
  (?:                      # begin a non-capture group 
    \d{2}\/\d{2}\/\d{2}[ ] # match a date string followed by a space
  ){2}                     # end the non-capture group and execute it twice
|                          # or
  $                        # match the end of the string
)                          # end positive lookahead

Upvotes: 1

Sree Kumar
Sree Kumar

Reputation: 2245

You can use this as base and make changes to get to the exact one you need:

\d+\/\d+\/\d+(.*?)\\n\\n(\s+\d+\/\d+\/\d+|$)

You can try it in the demo.

The changes I have made are these:

  • \n becomes \\n.
  • There is a space between \n\n and the dates in the sample text. I have added that in the regex.
  • The year part of the date in the regex was missing +. I have added that
  • The last part in the sample doesn't contain a date at the end. That check has been included.

Upvotes: 1

Related Questions