Reputation: 11

How can I use python regex to extract data with variable content?

I'm trying to build a market analysis tool. The raw data input is formatted like this:

20,000 shares for 550 USD each

meaning "20,000 shares of stock at 550 USD per share".

Normally, I would grab the price with the following bit of code:

value = re.findall(re.compile('20,000 shares for (.*) USD each'), data)

However, this approach fails me as the number of shares (in this case, 20 thousand) changes as well as the price value. Is there a better way to extract this data?

I apologize in advance for the improper description of my problem; I'm a bit of a newbie to Python and I'm not sure about what technical terms to use in this scenario. If there is a better way to word my title, please feel free to edit, and thank you in advance!

Upvotes: 1

Answers (2)

Sina Iravanian

Reputation: 16286

You can use more general patterns such as:

([\d,.]+) shares for ([\d,.]+) USD each

Also if you want to stick to .* for matching values, it's better to make it less greedy by turning it into .*? so that it does not eat the rest of your input.

If input can end in either each or per share use the following instead:

([\d,.]+) shares(?: of stock)? at ([\d,.]+) USD (?:each|per share)

Putting ?: after the opening parenthesis makes it a non-matching group, so it will not be captured along with the numbers which interest you.

Upvotes: 1

pcurry

Reputation: 1414

Use a character class to specify the share numbers and the share price in your regular expression.

(\d[\d,.]*) shares for ([\d,.]+) USD each

Depending on what your data looks like, you may not need to be as careful about capturing separators. For example, if only whole shares are traded, you don't need the decimal point in the first digit group.

If you might use the same regex on more than one dataset, it behooves you to compile it separately from using it in the findall.

import re
compiled_regex = re.compile("""(\d[\d,.]*) shares for ([\d,.]+) USD each""")

trades1 = re.findall(compiled_re, data1)
trades2 = re.findall(compiled_re, data2)

Upvotes: 0

How can I use python regex to extract data with variable content?

Answers (2)

Related Questions