Droid-Bird
Droid-Bird

Reputation: 1529

How to seperate numeric values from string using regex in python?

I have a string mixed with numbers and words. I want to be able to extract the numeric values from the string as tokens.

For example,

input
str = "Score 1 and 2 sometimes, often 1 and 1/2, or 2.5 or 3 and 1/3." should ideally 

output, 
Score -> word
1 -> number 
and -> word
2 -> number 
...
1 and 1/2 -> number (this group should stay together as number)
or -> word
2.5 -> number
...
3 and 1/3 -> number

I could solve the problem partly by using regex as follows,

rule 1:
re.findall(r'\s*(\d*\.?\d+)\s*', str1) and 
rule 2:
re.findall(r'(?:\s*\d* and \d+\/\d+\s*)', str1)

It partly works. I could not put these together to solve the problem. I tried this,

re.findall(r'(?:\s*(\d*\.?\d+)\s*)|(?:\s*\d* and \d+\/\d+\s*)', str1)

Can anyone please help and show how I could put the rules together and get the result?

Upvotes: 1

Views: 91

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

You can use

import re

text = "Score 1 and 2 sometimes, often 1 and 1/2, or 2.5 or 3 and 1/3."

matches = re.findall(r'((\d*\.?\d+(?:\/\d*\.?\d+)?)(?:\s+and\s+(\d*\.?\d+(?:\/\d*\.?\d+)?))?)', text)

result = []
for x,y,z in matches:
    if '/' in x:
        result.append(x)
    else:
        result.extend(filter(lambda x: x!="", [y,z]))

print( result )
# => ['1', '2', '1 and 1/2', '2.5', '3 and 1/3']

See the Python demo. Here is the regex demo.

Details:

  • The regex contains three capturing groups, around it as a whole, and two groups wrapping number or fraction patterns.
  • Once you get a match, either put the one with / char into the result, or the two other captures as separate items otherwise.

The regex par matches

  • ( - outer capturing group start (Group 1):
  • (\d*\.?\d+(?:\/\d*\.?\d+)?) - Group 2: a number/fraction pattern: zero or more digits, an optional ., one or more digits and then an optional occurrence of a / char and then zero or more digits, an optional ., one or more digits
  • (?:\s+and\s+(\d*\.?\d+(?:\/\d*\.?\d+)?))? - an optional occurrence of
    • \s+and\s+ - and word with one or more whitespaces around it
    • (\d*\.?\d+(?:\/\d*\.?\d+)?) - Group 3: number/fraction pattern
  • ) - outer capturing group end.

Upvotes: 1

Related Questions