Reputation: 1644
I'm trying to split any number string such as 3.1415926535897932384626433832795028841971
right after each 0
or group of 0
. However, I would like to keep the 0 after each group.
For example, the string 10203040506070809011
should be split into
['10', '20', '30', '40', '50', '60', '70', '80', '90', '11']
and the string 3.1415926535897932384626433832795028841971
should be split into
['3.14159265358979323846264338327950', '28841971']
I tried to split apart the string with a positive lookbehind and an empty string:
import re
p = '(?<=0+)'
re.search(p, '102030405')
><_sre.SRE_Match object; span=(2, 2), match=''>
'102030405'.split(p)
>['102030405']
but this does not split apart the string at all, even though the pattern is matched.
I also tried just splitting apart the string based on the 0
and adding a 0
after the first couple strings, but it seems convoluted and inefficient.
l = '102030405'.split('0')
[e+'0' for e in l[:-1]] + [l[-1]]
>['10', '20', '30', '40', '5']
Is there any way to split a string based on a lookahead or lookbehind on an empty string? I'm asking about the general case, not just with numbers. For example, if I wanted to split apart 3:18am5:19pm10:28am
into the separate times without losing the am
or pm
, and get an array ['3:18am', '5:19pm', '10:28am']
, how would I go about doing this?
Upvotes: 1
Views: 1097
Reputation: 479
Anubhava's answer is right. However, it requires to install regex module, which is not needed.
import re
pattern = r"(?<=0)(?=[1-9])"
s = "3.1415926535897932384626433832795028841971"
s2 = "10203040506070809011"
re.split(pattern, s)
# ['3.14159265358979323846264338327950', '28841971']
re.split(pattern, s2)
# ['10', '20', '30', '40', '50', '60', '70', '80', '90', '11']
You should check the re module page for more details about lookahead and lookbehind. If I were to explain it. I would say
(?<=...) means something before the separator.
(?=...) means something after the separator.
Thus, (?<=0)(?=[1-9]) means before the separator, which is empty, there should be a zero while after it, a 1 to 9.
Speed comparison between regex and re.
expression | time |
---|---|
re.split(r"(?<=0)(?=[1-9])", s) |
5.78 ns ± 0.103 ns |
regex.split(r"(?<=0)(?=[1-9])", s) |
6.04 ns ± 0.364 ns |
re.split(r"(?<=0)(?=[1-9])", s2) |
5.83 ns ± 0.061 ns |
regex.split(r"(?<=0)(?=[1-9])", s2) |
6.34 ns ± 1.16 ns |
Upvotes: 0
Reputation: 785276
Python split
requires a non-zero-width match.
You can use findall
with this regex to get your matches:
>>> print re.findall(r'([\d.]+?(?:0+|$))', '10203040506070809011')
['10', '20', '30', '40', '50', '60', '70', '80', '90', '11']
>>> print re.findall(r'([\d.]+?(?:0+|$))', '3.1415926535897932384626433832795028841971')
['3.14159265358979323846264338327950', '28841971']
([\d.]+?(?:0|$))
matches digit or dot that ends with 0
or end of line.
Update:
However I note from your edited question and comments that you're looking for a generic solution to use zero-width regex patterns for split operation.
I suggest you install very useful regex module in python. Version 1 of this module provides most of the PCRE features and far outweighs default re
module.
Installation is pretty straight forward. Just download the tar gzip file from above link and then run:
sudo python setup.py install
From inside the directory that you get after extracting the tar files. (Ignore few warning in install process).
Once regex
is installed just use this code:
>>> import regex
>>> regex.DEFAULT_VERSION = regex.VERSION1
>>> regex.split(r'(?<=[ap]m)(?=.)', '3:18am5:19pm10:28am')
['3:18am', '5:19pm', '10:28am']
>>> print regex.split(r'(?<=0)(?=[1-9])', '10203040506070809011')
['10', '20', '30', '40', '50', '60', '70', '80', '90', '11']
>>> print regex.split(r'(?<=0)(?=[1-9])', '3.1415926535897932384626433832795028841971')
['3.14159265358979323846264338327950', '28841971']
>>> print regex.split(r'(?<=0)(?=[1-9])', '10020')
['100', '20']
Upvotes: 1
Reputation: 18697
This simple regex in re.findall
should suffice:
l = re.findall(r'[.1-9]+(?:0+|$)', s)
Note:
findall
returns all non-overlapping matches of pattern in string, as a list of strings.
for each match we want the longest string of digits (or a dot) ending with at least one zero, or the end of the string
the zeros in the end should not be captured as another match (hence the (?:...
)
Similarly for you second example:
>>> re.findall(r'[\d:]+(?:am|pm|$)', '3:18am5:19pm10:28am')
['3:18am', '5:19pm', '10:28am']
No need for lookahead/lookbehind magic, or non-greedy matching.
Upvotes: 0
Reputation: 89567
use re.findall
:
l = re.findall(r'(?<![^0])[1-9.]+0*', s)
The key is to use a double negation: not preceded and that is not a zero (to match a preceding zero or the start of the string)
Upvotes: 1