Aleksander Lidtke
Aleksander Lidtke

Reputation: 2926

Integer pattern - Python regex

I've seen many posts on this but I still can't get it to work, I have no idea why.

What I have is a relatively simple strings with some floating point and integer numbers in it, e.g.: '2 1.000000000000000 1 1 0'. I want to extract only the integers from it, in this example only 2, 1, 1, 0 (not the 1 that's followed by 0s).

I know I have to use lookbehind and lookahead to rule out numbers that are preceded or followed by a .. I can successfully find the numbers that are preceded by a coma, in the said case the 0:

import re
IntegerPattern = re.compile('-?(?<=\.)\d+(?!\.)')
a = '2   1.000000000000000       1   1 0'
IntegerPattern.findall(a)

will return ['000000000000000'], exactly as I want. But when I try to find numbers that are not preceded by .s this doesn't work:

import re
IntegerPattern = re.compile('-?(?<!\.)\d+(?!\.)')
a = '2   1.000000000000000       1   1 0'
IntegerPattern.findall(a)

returns ['2', '00000000000000', '1', '1', '0']. Any ideas why? I'm completely new to regular expressions in general and this just eludes me. It should work but it does not. Any help would be appreciated.

Upvotes: 0

Views: 2832

Answers (4)

nu11p01n73R
nu11p01n73R

Reputation: 26667

Use the regex

(\s|^)\d+(\s|$)

the code can be

>>>  n='2 1.000000000000000 1 1 0'
>>> re.findall(r'(?:\s|^)\d+(?:\s|$)', n)
['2 ', ' 1 ', ' 0']

(\s*|^) matches a space or start of string

\d+ matches any number of digits

(\s*|$) matches space or end of string

NOTE: \b cannot be used to delimit \d+ as . is also included in \b

Example http://regex101.com/r/gP1nK0/1

EDIT

Why doesnt the regex (?<!\.)\d+(?!\.) work

now here the problem is when using look negative around assertions, we try to not to match . and the regex tries to match .

when you write (?<!\.) the regex finds a position where it can be successfull

that is in say 1.000000 the regex fixes the position second 0 such that the previous position is not . (which is zero) and remaining is 00000 thus winning. Hence it matches it

to get a clearer view check this link

http://regex101.com/r/gP1nK0/2

as you can see for the 1.000000000000000 the match occures from second 0 making it successfull

EDIT 1

a more perfect regex would be like

(?:(?<=^)|(?<=\s))\d+(?=\s|$)

>>>n
'1 2 3 4.5'
>>> re.findall(r'(?:(?<=^)|(?<=\s))\d+(?=\s|$)', n)
['1', '2', '3']
>>> n='1 2 3 4'
>>> re.findall(r'(?:(?<=^)|(?<=\s))\d+(?=\s|$)', n)
['1', '2', '3', '4']

Thank you sln for pointing that out

Upvotes: 3

Padraic Cunningham
Padraic Cunningham

Reputation: 180401

a = '-2   1.000000000000000       1   1 0'
print([x for x in a.split() if x[1:].isdigit() or x.isdigit()])
['-2', '1', '1', '0']

If you want the digits before the . also:

a = '2   1.000000000000000       1   1 0'


print([x if x.isdigit() else x.split(".")[0] for x in a.split() ])
['2', '1', '1', '1', '0']

Upvotes: 1

user557597
user557597

Reputation:

The engine is compensating to match.
It sheds a \d on the left, then matches.

This ensures no digits are shed on the left -

 # (?<![.\d])\d+(?!\.)

 (?<! [.\d] )
 \d+ 
 (?! \. )

Just a note - In your first pattern -?(?<=\.)\d+(?!\.)
The -? will never actually match a dash because it is not a \. which the assertion
states must be there.
The rule is never point an assertion in a direction that directly contains a literal
unless the literal is included in the assertion. In this case it is out of order anyway,
so the -? is entirely useless.

Upvotes: 0

deets
deets

Reputation: 6395

I wouldn't bother with regexes:

 s = '2   1.000000000000000       1   1 0'

 print [int(part) for part in s.split() if "." not in part]

It's often much simpler to work with basic string manipulation, or as the old saying goes "I had a problem I tried to solve with regexes. Then I had two problems"

Upvotes: 2

Related Questions