Reputation: 33

Filtering a number from different formats with Regex

I am trying to do some data analysis and there are some numbers that I want to analyze, the problem being that those numbers are in different string formats. These are the different formats:

"25,000,000 USD" or "9 500 USD" or "50,000 ETH"

It is basically always a number first, separated by either commas or blank spaces followed by the currency. Depending on the currency, i want to calculate the amount in USD afterwards.

I have looked up Regex expressions for the last hour and could not find anything that solves my problem. I definitely made some progress and implemented different expressions, but none worked 100%. It's always missing something as you will see below.

for i, row_value in df2['hardcap'].iteritems():
    try:

        q = df2['hardcap'][i]
        c = re.findall(r'[a-zA-Z]+', q)
        if c[0] == "USD":
            d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
            #Do something with the number
        elif c[0] == "EUR":
            d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
            #Do something with the number
        elif c[0] == "ETH": 
            d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
            #Do something with the number
        print(d[0])   

    except Exception:
        pass

So I am iterating through my dataframe column and first, ill find out which currency the number is related to, either "USD", "EUR" or "ETH" which I save in c. This part already works, after that, i want to extract the number in a form that can be converted to an integer so I can do calculations with it.

Right now, the line d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q) returns something like this in d[0]: ('100,000,000', ',000') if the number was 100,000,000 and ('270', '') if the number was 270 000 000

What I would like to get in the best case would be something like: 100000000 and 270000000, but any way to extract the whole numbers would suffice

I'd appreciate any bump in the right direction as I don't have much experience with regex and feel stuck right now.

Upvotes: 0

Answers (3)

Booboo

Reputation: 44098

import re

s = '25,000,000 USD 9 500 USD 50,000 ETH'
matches = re.findall(r'(\d[\d, ]*) ([A-Z]{3})', s)
l = [(int(match[0].replace(',', '').replace(' ', '')), match[1]) for match in matches]
print(l)

[(25000000, 'USD'), (9500, 'USD'), (50000, 'ETH')]

Upvotes: 0

RomanPerekhrest

Reputation: 92854

Optimized solution with re.search + re.sub functions:

import re

# equivalent for your df2['hardcap'] column values
hardcap = ["25,000,000 USD", "9 500 USD", "50,000 ETH"]

pat = re.compile(r'^(\d[\s,\d]*\d) ([A-Z]{3})')
for v in hardcap:
    m = pat.search(v)
    if m:    # if value is in the needed format
        amount, currency = m.group(1), m.group(2)
        amount = int(re.sub(r'\D*', '', amount))
        print(amount, currency)

Sample output:

25000000 USD
9500 USD
50000 ETH

Upvotes: 0

Andrej Kesely

Reputation: 195418

import re

s = '25,000,000 USD 9 500 USD 50,000 ETH'

for g in re.findall(r'(.*?)([A-Z]{3})', s):
    print(int(''.join(re.findall(r'\d', g[0]))), g[1])

Prints:

25000000 USD
9500 USD
50000 ETH

Upvotes: 1

Filtering a number from different formats with Regex

Answers (3)

Related Questions