jamest
jamest

Reputation: 43

why isnt my regex matching the below numerical strings?

if I use the money_conversion func on $17 million, it returns 17000000 etc, only when its a single digit does it return an incorrect match i.e. $7 million converts to 7 instead of 7000000

import re

number = r'\d+(,\d{3})*\.*\d*'                     #$790,000
amount = r'thousand|million|billion'                #$12.2 million example

word_re = rf'\${number}(-|\sto\s|–)?(\$*{number})\s?({amount})'
value_re = rf'\${number}'

def parse_word_syntax(string):
    value_string = re.search(number,string).group()
    value = float(value_string.replace(',',''))
    word = re.search(amount,string,flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value * word_value

def word_to_value(word):
    value_dict ={'thousand':1000,'million':1000000,'billion':1000000000}
    return value_dict[word]

def parse_value_syntax(string):
    value_string = re.search(number,string).group()
    value = float(value_string.replace(',',''))
    return value

def money_conversion(money):
    if money == 'N/A':
        return None
    
    if isinstance(money,list):
        money = money[0]
        
    word_syntax = re.search(word_re,money,flags=re.I)
    value_syntax = re.search(value_re,money)
    
    if word_syntax:
        print('converting word object to numerics')
        return parse_word_syntax(word_syntax.group())
    
    elif value_syntax:
        print('converting float objects to numerics')
        return parse_value_syntax(value_syntax.group())
    
    else:
        return None
'''

Upvotes: 1

Views: 54

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

The reason is quite simple: your regex does not match the word_re regex that looks like \$\d+(,\d{3})*\.*\d*(-|\sto\s|–)?(\$*\d+(,\d{3})*\.*\d*)\s?(thousand|million|billion), see its demo. You tried to make each subsequent pattern part optional, and you forgot the \d+ from the number variable block requires matching at least one digit, and since word_re contains two occurrences of number, the whole resulting regex requires at least two digits.

You need use

number = r'\d+(?:,\d{3})*(?:\.\d+)?'
word_re = rf'\${number}(?:(?:-|\sto\s|–)\${number})?\s*({amount})'

See the Python demo.

  • \$\d+(?:,\d{3})*(?:\.\d+)? - matches $, one or more digits, then zero or more repetitions of a comma and three digit chunk, and then an optional . and one or more digits
  • (?:(?:-|\sto\s|–)\$\d+(?:,\d{3})*(?:\.\d+)?)? - an optional sequence of:
    • (?:-|\sto\s|–) - -, whitespace+to+whitespace, or
    • \$ - a $ char
    • \d+(?:,\d{3})*(?:\.\d+)? - see above
  • \s* - zero or more whitespaces
  • (thousand|million|billion) - one of the three strings.

Upvotes: 2

Related Questions