Homunculus Reticulli
Homunculus Reticulli

Reputation: 68466

Python regex to parse financial data

I am relatively new to regex (always struggled with it for some reason)...

I have text that is of this form:

David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...

Mark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...

Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...

Parsing the text reveals the following structure:

  1. Two or more words beginning the sentence, and before the first comma, is the name of the person involved in the transaction
  2. One or more words before ('sold'|'bought'|'exercised'|'sold post-exercise') is the title of the person
  3. Presence of either one of these: ('sold'|'bought'|'exercised'|'sold post-exercise') AFTER the title, identifies the transaction type
  4. first numeric string following the transaction type ('sold'|'bought'|'exercised'|'sold post-exercise') denotes the size of the transaction
  5. 'price of ' preceeds a numeric string, which specifies the price at which the deal was struck.

My question is:

How can I use this knowledge (and regex), to write a function that parses similar text to return the variables of interest (listed 1 - 5 above)?

Pseudo code for the function I want to write ..

def grok_directors_dealings_text(text_input):
    name, title, transaction_type, lot_size, price = (None, None, None, None, None)
    ....
    name = ...
    title = ...
    transaction_type = ...
    lot_size = ...
    price = ...

    pass

How would I use regex to implement the functions to return the variables of interest when passed in text that conforms to the structure I have identified above?

[[Edit]]

For some reason, I have seemed to struggle with regex for a while, if I am to learn from the correct answer here on S.O, it will be much better, if an explanation is offered as to why the magical expression (sorry, regexpr) actually works.

I want to actually learn this stuff instead of copy pasting expressions ...

Upvotes: 0

Views: 693

Answers (4)

Pedro Pinheiro
Pedro Pinheiro

Reputation: 1069

I came up with this regex:

([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p

Regular expression visualization

Debuggex Demo

Basically, we are using the parenthesis to capture the important info you want so let's check it out each one:

  • ([\w ]+): \w matches any word character [a-zA-Z0-9_] one or more times, this will give us the name of the person;
  • ([\w ]+)Another one of these after a space and comma to get the title;
  • (sold post-exercise|sold|bought|exercised) then we search for our transaction types. Notice I put the post-exercise before the post so that it tries to match the bigger word first;
  • ([\d,\.]+) Then we try to find the numbers, which are made of digits (\d), a comma and probbably a dot may appear as well;
  • ([\d\.,]+) Then we need to get to the price which is basically the same as the size of the transaction.

The regex that connects each group are pretty basic as well.

If you try it on regex101 it provides some explanation about the regex and generates this code in python to use:

import re
p = re.compile(ur'([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p')

test_str = u"David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...\n\nMark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...\n\nAlbert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by..."

re.findall(p, test_str)

Upvotes: 1

DorElias
DorElias

Reputation: 2313

this is the regex that will do it

(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)

you use it like this

import re
def get_data(line):
    pattern = r"(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)"
    m = re.match(pattern, line)
    return m.groups()

for the first line this will return

('David Meredith', ' Financial Director ', 'sold post-exercise', '15,000', '1044.00') EDIT: adding explanation

this regex works as follows the first characters (.*?), mean - take the string until the next match(witch is the ,)

. means every character

the * means that it can be many times (many characters and not just 1)

? means dont be greedy, that means that it will use the first ',' and another one (if there are many ',')

after that there is this again (.*?) again take the characters until the next thing to match (with is the constant words)

after that there is (sold post-exercise|sold|bought|exercised) witch means - find one of the words (sperated by | )

after that there is a .*? witch again means take all text until next match (this time it is not surounded by () so it wont be selected as a group and wont be part of the output)

([\d|,]+) means take a digit (\d) or a comma. the + stands for one or more times

again .*? like before

'price of ' finds the actual string 'price of '

and last ([\d|.]+) means again take a digit or a dot (escaped because the character . is used by regex for 'any character') one or more times

Upvotes: 0

TigerhawkT3
TigerhawkT3

Reputation: 49320

You can use the following regex that just looks for characters surrounding the delimiters:

(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p

The parts in parentheses will be captured as groups.

>>> import re
>>> l = ['''David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...''', '''Mark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...''', '''Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...''']
>>> for s in l:
...     print(re.findall(r'(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p', s))
...
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00')]
[('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75')]
[('Albert Ellis', 'CEO', 'bought', '262', '52.00')]

Upvotes: 0

Pruthvi Raj
Pruthvi Raj

Reputation: 3036

You can use the following regex:

(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)

DEMO

Python:

import re

financialData = """
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...

Mark Brookes, Non Executive Director bought 811 shares in the company on  YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...

Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
"""

print(re.findall('(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)',financialData))

Output:

[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00p'), ('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75p'), ('Albert Ellis', 'CEO', 'bought', '262', '52.00p')]

EDIT 1

To understand how and what they mean, follow the DEMO link,on top right you can find a block explaining what each and every character means as follows:

enter image description here

Also Debuggex helps you simulate the string by showing what group matches which characters!

Here's a debuggex demo for your particular case:

(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)

Regular expression visualization

Debuggex Demo

Upvotes: 2

Related Questions