Evy555
Evy555

Reputation: 229

Python Regex for Securities

I have a text file that contains security name, $ amounts, and % of the portfolio. I'm trying to figure out how to separate the companies using regex. I had an original solution that allowed me to .split('%') and then create the 3 variables I needed, but I discovered some of the securities contain % in their name and thus the solution was inadequate.

String example:

Pinterest, Inc. Series F, 8.00%$24,808,9320.022%ResMed,Inc.$23,495,3260.021%Eaton Corp. PLC$53,087,8430.047%

Current regex

[a-zA-Z0-9,$.\s]+[.0-9%]$

My current regex only finds the last company. example, Eaton Corp. PLC$53,087,8430.047%

Any help on how I can find every single instance of a company?

Solution desired

["Pinterest, Inc. Series F, 8.00%$24,808,9320.022%","ResMed,Inc.$23,495,3260.021%","Eaton Corp. PLC$53,087,8430.047%"]

Upvotes: 2

Views: 109

Answers (2)

linden2015
linden2015

Reputation: 887

A working solution for Python, with named groups: https://regex101.com/r/sqkFaN/2

(?P<item>(?P<name>.*?)\$(?P<usd>[\d,\.]*?%))

At the link I provided you can see changes have effect in real-time, and the sidebar provides an explanation for the used syntax.

Upvotes: 1

cxw
cxw

Reputation: 17051

In Python 3:

import re
p = re.compile(r'[^$]+\$[^%]+%')
p.findall('Pinterest, Inc. Series F, 8.00%$24,808,9320.022%ResMed,Inc.$23,495,3260.021%Eaton Corp. PLC$53,087,8430.047%')

Result:

['Pinterest, Inc. Series F, 8.00%$24,808,9320.022%', 'ResMed,Inc.$23,495,3260.021%', 'Eaton Corp. PLC$53,087,8430.047%']

Your initial issue was that the $ anchor made the regex only match at the end of the line. However, removing the $ still split Pinterest into two entries at the % after 8.00.

To fix that, the regex looks for a $, then a % after that, and takes everything up through the % as an entry. That pattern works for the examples you gave, but, of course, I can't know if it holds true for all your data.

Edit The regex works like this:

r'               Use a raw string so you don't have to double the backslashes
  [^$]+          Look for anything up to the next $
       \$        Match the $ itself (\$ because $ alone means end-of-line)
         [^%]+   Now anything up to the next %
              %  And the % itself
               ' End of the string

Upvotes: 3

Related Questions