Reputation: 2926
I've seen many posts on this but I still can't get it to work, I have no idea why.
What I have is a relatively simple strings with some floating point and integer numbers in it, e.g.: '2 1.000000000000000 1 1 0'
. I want to extract only the integers from it, in this example only 2, 1, 1, 0
(not the 1
that's followed by 0
s).
I know I have to use lookbehind
and lookahead
to rule out numbers that are preceded or followed by a .
. I can successfully find the numbers that are preceded by a coma, in the said case the 0
:
import re
IntegerPattern = re.compile('-?(?<=\.)\d+(?!\.)')
a = '2 1.000000000000000 1 1 0'
IntegerPattern.findall(a)
will return ['000000000000000']
, exactly as I want. But when I try to find numbers that are not preceded by .
s this doesn't work:
import re
IntegerPattern = re.compile('-?(?<!\.)\d+(?!\.)')
a = '2 1.000000000000000 1 1 0'
IntegerPattern.findall(a)
returns ['2', '00000000000000', '1', '1', '0']
. Any ideas why? I'm completely new to regular expressions in general and this just eludes me. It should work but it does not. Any help would be appreciated.
Upvotes: 0
Views: 2832
Reputation: 26667
Use the regex
(\s|^)\d+(\s|$)
the code can be
>>> n='2 1.000000000000000 1 1 0'
>>> re.findall(r'(?:\s|^)\d+(?:\s|$)', n)
['2 ', ' 1 ', ' 0']
(\s*|^)
matches a space or start of string
\d+
matches any number of digits
(\s*|$)
matches space or end of string
NOTE: \b
cannot be used to delimit \d+
as .
is also included in \b
Example http://regex101.com/r/gP1nK0/1
EDIT
Why doesnt the regex (?<!\.)\d+(?!\.)
work
now here the problem is when using look negative around assertions, we try to not to match .
and the regex tries to match .
when you write (?<!\.)
the regex finds a position where it can be successfull
that is in say 1.000000
the regex fixes the position second 0
such that the previous position is not .
(which is zero) and remaining is 00000
thus winning. Hence it matches it
to get a clearer view check this link
http://regex101.com/r/gP1nK0/2
as you can see for the 1.000000000000000
the match occures from second 0
making it successfull
EDIT 1
a more perfect regex would be like
(?:(?<=^)|(?<=\s))\d+(?=\s|$)
>>>n
'1 2 3 4.5'
>>> re.findall(r'(?:(?<=^)|(?<=\s))\d+(?=\s|$)', n)
['1', '2', '3']
>>> n='1 2 3 4'
>>> re.findall(r'(?:(?<=^)|(?<=\s))\d+(?=\s|$)', n)
['1', '2', '3', '4']
Thank you sln for pointing that out
Upvotes: 3
Reputation: 180401
a = '-2 1.000000000000000 1 1 0'
print([x for x in a.split() if x[1:].isdigit() or x.isdigit()])
['-2', '1', '1', '0']
If you want the digits before the .
also:
a = '2 1.000000000000000 1 1 0'
print([x if x.isdigit() else x.split(".")[0] for x in a.split() ])
['2', '1', '1', '1', '0']
Upvotes: 1
Reputation:
The engine is compensating to match.
It sheds a \d
on the left, then matches.
This ensures no digits are shed
on the left -
# (?<![.\d])\d+(?!\.)
(?<! [.\d] )
\d+
(?! \. )
Just a note - In your first pattern -?(?<=\.)\d+(?!\.)
The -?
will never actually match a dash because it is not a \.
which the assertion
states must be there.
The rule is never point an assertion in a direction that directly contains a literal
unless the literal is included in the assertion. In this case it is out of order anyway,
so the -?
is entirely useless.
Upvotes: 0
Reputation: 6395
I wouldn't bother with regexes:
s = '2 1.000000000000000 1 1 0'
print [int(part) for part in s.split() if "." not in part]
It's often much simpler to work with basic string manipulation, or as the old saying goes "I had a problem I tried to solve with regexes. Then I had two problems"
Upvotes: 2