Reputation: 55
Sample text: 2019 2018 2017 ... 2015 2014 2013 total liabilities 1,455 1,633 stockowners' equity 2,100 2,599
Desired output: full match: 2015 2014 2013 total liabilities 1,455 1,633 stockowners' equity 2,100 2,599
group 1 (years) = 2015 2014 2013 group 2 (target data) = stockowners' equity 2,100 2,599
I only need to match years 20xx, 19xx. The digits after target may be more or less than 2 and may or may not contain , or be preceded by a $ sign
((?:20\d\d\s*|19\d\d\s*)+).*(stockowners'\s+equity\s+(?:\s*\$?\s*\d+,?\d+)+)
(note I have dotall flag ticked)
The problem with the current regex is that it picks up the first string of dates and then everything up to stockowners' equity. How do I pick up the last date sequence? I thought about reversing the string and searching backwards but it's a large text file and it takes too long.
Any help would be appreciated
Secondary Example: For the string: "2019 2018 assets $300 2017 2016 liabilities $100 equity $200" and target equity I want to pick up group 1: 2017 2016, group 2: equity $200; full match: 2017 2016 liabilities $100 equity $200
Elaboration: I am trying to use regex to pick up information from older SEC filings (mostly 10-ks). These documents don't have enough html tags to make parsing with beautifulsoup useful. I copied the below from once such file. Suppose I want to pick up the data on investment securities. I want to get 101,017 and 91,339, but also the 2001 and 2000 at the top; so that I know what years the figures correspond to.
My problem is these documents are full of tables and they all start with years at the top (but not always the same years). I want to pick up the years from the table which has my target investment securities below.
At December 31 (In millions) 2001 2000
ASSETS Cash and equivalents $ 9,082 $ 8,195 Investment securities 101,017 91,339
Upvotes: 2
Views: 60
Reputation: 163297
You could use match until the last occurrence of matching 19 or 20 followed by 2 digits where there is not a digit followed by a space on the left and then capture the repeating years part in group 1.
Then capture the stockowners part in group 2.
((?:\b(?:20|19)\d\d\s+)+)(?!.*(?:\b(?:20|19)\d\d\b)).*?\s+((?:stockowners'\s+)?\bequity\s+\$?\s*\d+(?:,\d+)?(?:\s+\$?\s*\d+(?:,\d+)?)*)$
The pattern will match
(
Capture group 1
(?:\b(?:20|19)\d\d\s+)+
Repeat 1+ times matching either 20 or 19 followed by 2 digits and 1+ whitespace chars)
Close group 1(?!.*(?:\b(?:20|19)\d\d\b))
Negative lookahead, assert that on the right there is no more occurrence of the year pattern.*?\s
Match 0+ times any char except a newline as least as possible and match a whitespace char(
Capture group 2
(?:stockowners'\s+)?
Optionally match `stockowners'\bequity\s+
Match equity
\$?\s*\d+(?:,\d+)?
Match optional $
and 1+ digits with optional decimal part(?:\s+\$?\s*\d+(?:,\d+)?)*
Repeat 0+ times matching the same as previous part with 1+ whitespace chars prepended)
Close group 2$
End of stringUpvotes: 2