Reputation: 845
I am trying to convert written numbers to numeric values.
For example, to extract millions from this string:
text = 'I need $ 150000000, or 150 million,1 millions, 15 Million, 15million, 15Million, 15 m, 15 M, 15m, 15M, 15 MM, 15MM, 5 thousand'
To:
'I need $ 150000000, or 150000000,1000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 15000000, 5 thousand'
I use this function to remove any separators in the numbers first:
def foldNumbers(text):
""" to remove "," or "." from numbers """"
text = re.sub('(?<=[0-9])\,(?=[0-9])', "", text) # remove commas
text = re.sub('(?<=[0-9])\.(?=[0-9])', "", text) # remove points
return text
And I have written this regex to findall of the possible patterns for common Million notations. This 1) finds digits and does a look ahead for 2) common notation for millions, 3) The "[a-z]?" part is to handle optional "s" on million or millions where I have already removed "'".
re.findall(r'(?:[\d\.]+)(?= million[a-z]?|million[a-z]?| Million[a-z]?|Million[a-z]?|m| m|M| M|MM| MM)',text)
which correctly matches Million numbers and returns:
['150', '1', '15', '15', '15', '15', '15', '15', '15', '15', '15']
What I need to do now is to write a replacement pattern to insert "000000" after the digits, or to iterate through and multiply the digits by 100000. I have tried this so far:
re.sub(r'(?:[\d\.]+)(?= million[a-z]?|million[a-z]?| Million[a-z]?|Million[a-z]?|m| m|M| M|MM| MM)', "000000 ", text)
which returns:
'I need $ 150,000,000, or 000000 million,000000 millions, 000000 Million, 000000 million, 000000 Million, 000000 m, 000000 M, 000000 m, 000000 M, 000000 MM, 000000 MM, 5 thousand'
I think I need to do a look behind (?<=), however I haven't worked with this before and after several attempts I cant seem to work it through.
FYI: My plan is to tackle "Millions" first and then to replicate the solution for Thousands (K), Billions (B), Trillions (T) and possibly for other units such as distances, currencies etc. I have searched SO and google for any solutions in NLP, text cleaning and mining articles but did not find anything.
Upvotes: 1
Views: 855
Reputation: 371019
You can accomplish this with a relatively simple re.sub
: match
(?i)\b(\d+) ?m(?:m|illions?)?\b
capturing the initial digits in a group, and replace with that group concatenated with 6 zeros:
r'\g<1>000000'
https://regex101.com/r/IedRP4/1
Code:
text = 'I need $ 150000000, or 150 million,1 millions, 15 Million, 15million, 15Million, 15 m, 15 M, 15m, 15M, 15 MM, 15MM, 5 thousand'
output = re.sub(r'(?i)\b(\d+) ?m(?:m|illions?)?\b', r'\g<1>000000', text)
(because the group in the replacement is followed by digits, make sure to use \g<#>
syntax rather than \#
syntax)
Upvotes: 1