Reputation: 11
I'm trying to build a market analysis tool. The raw data input is formatted like this:
20,000 shares for 550 USD each
meaning "20,000 shares of stock at 550 USD per share".
Normally, I would grab the price with the following bit of code:
value = re.findall(re.compile('20,000 shares for (.*) USD each'), data)
However, this approach fails me as the number of shares (in this case, 20 thousand) changes as well as the price value. Is there a better way to extract this data?
I apologize in advance for the improper description of my problem; I'm a bit of a newbie to Python and I'm not sure about what technical terms to use in this scenario. If there is a better way to word my title, please feel free to edit, and thank you in advance!
Upvotes: 1
Views: 167
Reputation: 16286
You can use more general patterns such as:
([\d,.]+) shares for ([\d,.]+) USD each
Also if you want to stick to .*
for matching values, it's better to make it less greedy by turning it into .*?
so that it does not eat the rest of your input.
If input can end in either each
or per share
use the following instead:
([\d,.]+) shares(?: of stock)? at ([\d,.]+) USD (?:each|per share)
Putting ?:
after the opening parenthesis makes it a non-matching group, so it will not be captured along with the numbers which interest you.
Upvotes: 1
Reputation: 1414
Use a character class to specify the share numbers and the share price in your regular expression.
(\d[\d,.]*) shares for ([\d,.]+) USD each
Depending on what your data looks like, you may not need to be as careful about capturing separators. For example, if only whole shares are traded, you don't need the decimal point in the first digit group.
If you might use the same regex on more than one dataset, it behooves you to compile it separately from using it in the findall.
import re
compiled_regex = re.compile("""(\d[\d,.]*) shares for ([\d,.]+) USD each""")
trades1 = re.findall(compiled_re, data1)
trades2 = re.findall(compiled_re, data2)
Upvotes: 0