italianfoot
italianfoot

Reputation: 199

Finding numbers with a certain format

I have to find all the numbers in a file that follow a specific format. The format is as follows:

Each number is positive or negative (the sign might or might not be present, there are one or more digits before the decimal place, and one or more digits after the decimal place). There might not be a decimal place. There can be spaces before and after each number. Two numbers are separated by commas (,) or semicolons (;) or colons (:). For example (35.3 , 52.23; -623, 623.62 : -52,65)

So in the above example there are six numbers that I want listed. The list of numbers to be searched is between parenthesis. Until now my code looks like this:

def number_processing( file_location ):
    """"""

    import re

    file_variable = open( file_location )
    lines = file_variable.readlines()

    numbers = re.compile(r'[(] *[+]?[-]?[0-9][0-9]*[.]+[,]+[;]+[0-9][0-9]* *[)]')
    numbers_list = []

    for line in lines:
        for word in line.split(" "):
            match = numbers.match(word)
            if match:
                numbers_list.append(match.group())
      print numbers_list

Any help is greatly appreciated!

Upvotes: 2

Views: 467

Answers (4)

wim
wim

Reputation: 362517

I don't think you need to use str.split, how about just using re.findall?

>>> s = '35.3 , 52.23; -623, 623.62 : -52,65'
>>> re.findall(r'[-+]?\d+(?:\.\d*)?', s)
['35.3', '52.23', '-623', '623.62', '-52', '65']

edit: to only search inside parentheses pairs, you can write another regex to find those first, and then reuse the one above:

>>> s = '(23432.434 , 32423, -4343; 343) 5555 (3244, 45445; -4545 )'
>>> for s_ in re.findall(r'\(.*?\)', s):
...   re.findall(r'[-+]?\d+(?:\.\d*)?', s_)
... 
['23432.434', '32423', '-4343', '343']
['3244', '45445', '-4545']

To join all the above sub-lists in a list comprehension:

>>> s = '(23432.434 , 32423, -4343; 343) 5555 (3244, 45445; -4545 )'
>>> pat1 = re.compile(r'\(.*?\)')
>>> pat2 = re.compile(r'[-+]?\d+(?:\.\d*)?')
>>> [x for s_ in re.findall(pat1, s) for x in re.findall(pat2, s_)]
['23432.434', '32423', '-4343', '343', '3244', '45445', '-4545']

Upvotes: 6

joe
joe

Reputation: 817

If you just want to find all the numbers, why not do something like:

re.findall(r'[+-\d.]', text)

And not worry about the extraneous formatting?

(Note: this will match e.g. +2323., which is weird input, but Python can deal with it. If you do

map(float, re.findall(r'[+-\d.]', text))

you'll still get a nice pretty list of floats.)

Upvotes: 0

Scott Weaver
Scott Weaver

Reputation: 7351

if you want to just consume these tokens with a regex, run the expression globally and eat numbers, dashes, and decimal points in a greedy, simple fashion:

/[\d.-]+/

Upvotes: 0

skytreader
skytreader

Reputation: 11697

Since you are already splitting by space, a regex something like...

[(]?[+-]?\d+\.?\d+[,;:]?[)]?

(Not sure if you must escape the parentheses...just check ;D

AND

[+]?[-]? - What's with this? You are telling your regex that a + and - may occur both, though it is possible that one or both of them is absent.

And, the whole of your regex tries to recognize two separate numbers at once (if I get what you mean by commas, semicolons and colons right), the second one an integer at that. There are a lot of test cases where what you won't get what you want.

Upvotes: 0

Related Questions