user3151858
user3151858

Reputation: 820

How do I make regex non-greedy to extract specific element

I have the following text from which I need to extract certain phrases:

Restricted Cash 951 37505 Accounts Receivable - Affiliate 31613 27539 Accounts
 Receivable - Third Party 23091 2641 Crude Oil Inventory 2200 0 Other Current
 Assets 2724 389 
Total Current Assets 71319 86100 Property Plant and Equipment Total Property 
Plant and Equipment Gross 1500609 706039 Less Accumulated 
Depreciation and Amortization (79357) (44271) Total Property Plant and Equipment
 Net 1421252 661768 Intangible Assets Net 310202 0 Goodwill 109734 0 Investments
 82317 80461 Other Noncurrent Assets 3093 1429 Total Assets 1997917 829758 
LIABILITIES Current Liabilities Accounts Payable - Affiliate 2778 1616 Accounts
 Payable - Trade 92756 109893 Other Current Liabilities 9217 2876 Total Current
 Liabilities 104751 114385 Long-Term Liabilities Long-Term Debt 559021 85000
 Asset Retirement Obligations 17330 10416 Other Long-Term Liabilities 582 3727 
Total Liabilities 681684 213528 EQUITY Partners' Equity Limited Partner 
Common Units (23759 and 23712 units outstanding respectively) 699866 642616
 Subordinated Units (15903 units outstanding) (130207) (168136) General Partner 2421 520 
Total Partners' Equity 572080 475000 Noncontrolling Interests 744153 141230 Total 
Equity 1316233 616230 Total Liabilities and Equity 1997917 829758

I need to remove all phrases that would be in parenthesis, i.e. (), and also would contain number with word outstanding or units.

Based on these conditions, I have two phrases that needs to be removed:

  1. (23759 and 23712 units outstanding respectively)
  2. (15903 units outstanding)

I have tried the following Regex in Python:

\(\d+.+?(outstanding)+?\)

The idea was that .+? after \d+ will make Regex non-greedy (lazy). However, regex selects huge segment starting from (79357) (44271) Total Property Plant and Equipment till outstanding) which is greedy.

The unique marker here is word outstanding, may be there is better approach to extracting those phrases?

Upvotes: 1

Views: 28

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

You may use

\(\d[^()]*outstanding[^()]*\)

See the regex demo and the regex graph:

enter image description here

Details

  • \( - ( char
  • \d - a digit
  • [^()]* - 0+ chars other than ( and )
  • outstanding - a substring
  • [^()]* - 0+ chars other than ( and )
  • \) - a ) char.

Python:

re.findall(r'\(\d[^()]*outstanding[^()]*\)', s)

Upvotes: 1

Related Questions