Reputation: 33
I am trying to do some data analysis and there are some numbers that I want to analyze, the problem being that those numbers are in different string formats. These are the different formats:
"25,000,000 USD" or "9 500 USD" or "50,000 ETH"
It is basically always a number first, separated by either commas or blank spaces followed by the currency. Depending on the currency, i want to calculate the amount in USD afterwards.
I have looked up Regex expressions for the last hour and could not find anything that solves my problem. I definitely made some progress and implemented different expressions, but none worked 100%. It's always missing something as you will see below.
for i, row_value in df2['hardcap'].iteritems():
try:
q = df2['hardcap'][i]
c = re.findall(r'[a-zA-Z]+', q)
if c[0] == "USD":
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
#Do something with the number
elif c[0] == "EUR":
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
#Do something with the number
elif c[0] == "ETH":
d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q)
#Do something with the number
print(d[0])
except Exception:
pass
So I am iterating through my dataframe column and first, ill find out which currency the number is related to, either "USD", "EUR" or "ETH" which I save in c. This part already works, after that, i want to extract the number in a form that can be converted to an integer so I can do calculations with it.
Right now, the line d = re.findall(r'^(\d?\d?\d(,\d\d\d)*|\d)', q) returns something like this in d[0]: ('100,000,000', ',000') if the number was 100,000,000 and ('270', '') if the number was 270 000 000
What I would like to get in the best case would be something like: 100000000 and 270000000, but any way to extract the whole numbers would suffice
I'd appreciate any bump in the right direction as I don't have much experience with regex and feel stuck right now.
Upvotes: 0
Views: 57
Reputation: 44098
import re
s = '25,000,000 USD 9 500 USD 50,000 ETH'
matches = re.findall(r'(\d[\d, ]*) ([A-Z]{3})', s)
l = [(int(match[0].replace(',', '').replace(' ', '')), match[1]) for match in matches]
print(l)
[(25000000, 'USD'), (9500, 'USD'), (50000, 'ETH')]
Upvotes: 0
Reputation: 92854
Optimized solution with re.search
+ re.sub
functions:
import re
# equivalent for your df2['hardcap'] column values
hardcap = ["25,000,000 USD", "9 500 USD", "50,000 ETH"]
pat = re.compile(r'^(\d[\s,\d]*\d) ([A-Z]{3})')
for v in hardcap:
m = pat.search(v)
if m: # if value is in the needed format
amount, currency = m.group(1), m.group(2)
amount = int(re.sub(r'\D*', '', amount))
print(amount, currency)
Sample output:
25000000 USD
9500 USD
50000 ETH
Upvotes: 0
Reputation: 195418
import re
s = '25,000,000 USD 9 500 USD 50,000 ETH'
for g in re.findall(r'(.*?)([A-Z]{3})', s):
print(int(''.join(re.findall(r'\d', g[0]))), g[1])
Prints:
25000000 USD
9500 USD
50000 ETH
Upvotes: 1