psun
psun

Reputation: 635

Extract only percentage information from text in python using regex

I'm trying to extract only valid percentage information and eliminate any incorrect representation from a string using regular expression in python. The function should work like this, For,

0-100%  = TRUE
0.12% = TRUE
23.1245467% = TRUE
9999% = FALSE
8937.2435% = FALSE
7.% = FALSE

I have checked a few solutions in stack overflow which only extract 0-100%. I have tried the following solutions,

('(\s100|[123456789][0-9]|[0-9])(\.\d+)+%')
'(\s100|\s\d{1,2})(\.\d+)+%'
'(\s100|\s\d[0-99])(\.\d+)+%'

All these works for all other possibilities except 0-99%(gives FALSE) and 12411.23526%(gives TRUE). The reason for space is that I want to extract only two digit numbers.

Upvotes: 2

Views: 4174

Answers (3)

psun
psun

Reputation: 635

Figured it out. The problem lied in '+' in the expression '(\.\d+)+' whereas it should have been '(\.\d+)*'. The first expression expects to have decimal values for any two digit percentage values whereas the second doesn't. My final version is given below.

'\s(100|(\d{1,2}(\.\d+)*))%' 

You can replace \s with $ for percentage values at the beginning of a sentence. Also, the versions in my question section accepted decimal values for 100 which is invalid percentage value.

Upvotes: 1

user2705585
user2705585

Reputation:

Considering all possibilities following regex works.

If you just ignore the ?: i.e non-capturing group regex is not that intimidating.

Regex: ^(?:(?:\d{1,2}(?:\.\d+)?\-)?(?:(?:\d{1,2}(?:\.\d+)?)|100))%$

Explanation:

  • (?:(?:\d{1,2}(?:\.\d+)?\-)? matches lower limit if there is any, as in case of 0-100% with optional decimal part.

  • (?:(?:\d{1,2}(?:\.\d+)?)|100) matches the upper limit or if only single number with limit of 100 with optional decimal part.

Regex101 Demo


Another version of the same regex for matching such occurrences within the string would be to remove the anchor ^ and $ and check for non-digits at the beginning.

Regex: (?<=\D|^)(?:(?:\d{1,2}(?:\.\d+)?\-)?(?:(?:\d{1,2}(?:\.\d+)?)|100))%

Regex101 Demo

Upvotes: 0

Jan
Jan

Reputation: 43169

I would not rely on regex alone - it is not meant to filter ranges in the first place.
Better look for candidates in your string and analyze them programmatically afterwards, like so:

import re

string = """
some gibberish in here 0-100%  = TRUE
some gibberish in here  0.12% = TRUE
some gibberish in here 23.1245467% = TRUE
some gibberish in here  9999% = FALSE
some gibberish in here 8937.2435% = FALSE
some gibberish in here 7.% = FALSE
"""

numbers = []
# look for -, a digit, a dot ending with a digit and a percentage sign
rx = r'[-\d.]+\d%'

# loop over the results
for match in re.finditer(rx, string):
    interval = match.group(0).split('-')
    for number in interval:
        if 0 <= float(number.strip('%')) <= 100:
            numbers.append(number)

print numbers
# ['0', '100%', '0.12%', '23.1245467%']

Upvotes: 0

Related Questions