Reputation: 57
I have text like:
05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC
COM
Payable: 05/06/2021
QUALIFIED DIVIDENDS 23.50
ATVI - 0.00 23.50 (9,425.77)
05/13/21 05/13/21 Margin Div/Int - Income APPLE INC
COM
Payable: 05/13/2021
QUALIFIED DIVIDENDS 6.16
AAPL - 0.00 6.16 (9,419.61)
05/28/21 05/28/21 Margin Div/Int - Income STARBUCKS CORP
COM
Payable: 05/28/2021
QUALIFIED DIVIDENDS 18.00
SBUX - 0.00 18.00 (9,401.61)
05/28/21 05/28/21 Margin Div/Int - Expense MARGIN INTEREST CHARGE
Payable: 05/28/2021
- - 0.00 (73.03) (9,474.64)
I want to extract individual records, such as:
05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC
COM
Payable: 05/06/2021
QUALIFIED DIVIDENDS 23.50
ATVI - 0.00 23.50 (9,425.77)
and
05/13/21 05/13/21 Margin Div/Int - Income APPLE INC
COM
Payable: 05/13/2021
QUALIFIED DIVIDENDS 6.16
AAPL - 0.00 6.16 (9,419.61)
and
05/28/21 05/28/21 Margin Div/Int - Expense MARGIN INTEREST CHARGE
Payable: 05/28/2021
- - 0.00 (73.03) (9,474.64)
Here the pattern of each record should start with date(\d+/\d+/\d)
and end with (\n\n\d+/\d+/\d)
I have tried like (re.findall(r'\d+/\d+/\d(.*?)\n\n\d+/\d+/\d+',a))
. But it doesn't works as expected
Upvotes: 0
Views: 72
Reputation: 163207
You can match a date like pattern at the start of the string, and repeat all lines that do not start with matching a date like pattern.
^\d+/\d+/\d+ .*(?:\n(?!^\d+/\d+/\d+ ).*)*
The pattern matches:
^
Start of string\d+/\d+/\d+
Match a date like pattern and a space.*
Match the rest of the line(?:
Non capture group
\n(?!^\d+/\d+/\d+ ).*
Match a newline and the rest of the line if it does not start with a date like pattern)*
Close the non capture group and optionally repeat itSee a regex demo and a Python demo.
Use can use re.findall to get all the matches:
import re
pattern = r"^\d+/\d+/\d+ .*(?:\n(?!^\d+/\d+/\d+ ).*)*"
s = ("05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC\n\n....")
print(re.findall(pattern, s, re.M))
Upvotes: 1
Reputation: 110665
You can match
.+?(?=\s*(?:\d{2}\/\d{2}\/\d{2} ){2}|$)
with 'g' ("global") and 's' ("single line" or "dot-all") flags set. 's' causes periods to match all characters, including line terminators.
The regular expression can be broken down as follows.
.+? # match one or more chars, lazily
(?= # begin a positive lookahead
\s* # match zero or more whitespaces
(?: # begin a non-capture group
\d{2}\/\d{2}\/\d{2}[ ] # match a date string followed by a space
){2} # end the non-capture group and execute it twice
| # or
$ # match the end of the string
) # end positive lookahead
Upvotes: 1
Reputation: 2245
You can use this as base and make changes to get to the exact one you need:
\d+\/\d+\/\d+(.*?)\\n\\n(\s+\d+\/\d+\/\d+|$)
You can try it in the demo.
The changes I have made are these:
\n
becomes \\n
.\n\n
and the dates in the sample text. I have added that in the regex.+
. I have added thatUpvotes: 1