victor
victor

Reputation: 1644

Split string with lookahead/lookbehind with empty string

I'm trying to split any number string such as 3.1415926535897932384626433832795028841971 right after each 0 or group of 0. However, I would like to keep the 0 after each group.

For example, the string 10203040506070809011 should be split into

['10', '20', '30', '40', '50', '60', '70', '80', '90', '11']

and the string 3.1415926535897932384626433832795028841971 should be split into

['3.14159265358979323846264338327950', '28841971']

I tried to split apart the string with a positive lookbehind and an empty string:

import re
p = '(?<=0+)'

re.search(p, '102030405')
><_sre.SRE_Match object; span=(2, 2), match=''>

'102030405'.split(p)
>['102030405']

but this does not split apart the string at all, even though the pattern is matched.

I also tried just splitting apart the string based on the 0 and adding a 0 after the first couple strings, but it seems convoluted and inefficient.

l = '102030405'.split('0')
[e+'0' for e in l[:-1]] + [l[-1]]
>['10', '20', '30', '40', '5']

Is there any way to split a string based on a lookahead or lookbehind on an empty string? I'm asking about the general case, not just with numbers. For example, if I wanted to split apart 3:18am5:19pm10:28am into the separate times without losing the am or pm, and get an array ['3:18am', '5:19pm', '10:28am'], how would I go about doing this?

Upvotes: 1

Views: 1097

Answers (4)

Eric Chow
Eric Chow

Reputation: 479

Anubhava's answer is right. However, it requires to install regex module, which is not needed.

import re
pattern = r"(?<=0)(?=[1-9])"
s = "3.1415926535897932384626433832795028841971"
s2 = "10203040506070809011"
re.split(pattern, s)
# ['3.14159265358979323846264338327950', '28841971']
re.split(pattern, s2)
# ['10', '20', '30', '40', '50', '60', '70', '80', '90', '11']

You should check the re module page for more details about lookahead and lookbehind. If I were to explain it. I would say

(?<=...) means something before the separator.

(?=...) means something after the separator.

Thus, (?<=0)(?=[1-9]) means before the separator, which is empty, there should be a zero while after it, a 1 to 9.

Speed comparison between regex and re.

expression time
re.split(r"(?<=0)(?=[1-9])", s) 5.78 ns ± 0.103 ns
regex.split(r"(?<=0)(?=[1-9])", s) 6.04 ns ± 0.364 ns
re.split(r"(?<=0)(?=[1-9])", s2) 5.83 ns ± 0.061 ns
regex.split(r"(?<=0)(?=[1-9])", s2) 6.34 ns ± 1.16 ns

Upvotes: 0

anubhava
anubhava

Reputation: 785276

Python split requires a non-zero-width match.

You can use findall with this regex to get your matches:

>>> print re.findall(r'([\d.]+?(?:0+|$))', '10203040506070809011')
['10', '20', '30', '40', '50', '60', '70', '80', '90', '11']

>>> print re.findall(r'([\d.]+?(?:0+|$))', '3.1415926535897932384626433832795028841971')
['3.14159265358979323846264338327950', '28841971']

([\d.]+?(?:0|$)) matches digit or dot that ends with 0 or end of line.


Update:

However I note from your edited question and comments that you're looking for a generic solution to use zero-width regex patterns for split operation.

I suggest you install very useful regex module in python. Version 1 of this module provides most of the PCRE features and far outweighs default re module.

Installation is pretty straight forward. Just download the tar gzip file from above link and then run:

sudo python setup.py install

From inside the directory that you get after extracting the tar files. (Ignore few warning in install process).

Once regex is installed just use this code:

>>> import regex

>>> regex.DEFAULT_VERSION = regex.VERSION1

>>> regex.split(r'(?<=[ap]m)(?=.)', '3:18am5:19pm10:28am')
['3:18am', '5:19pm', '10:28am']

>>> print regex.split(r'(?<=0)(?=[1-9])', '10203040506070809011')
['10', '20', '30', '40', '50', '60', '70', '80', '90', '11']

>>> print regex.split(r'(?<=0)(?=[1-9])', '3.1415926535897932384626433832795028841971')
['3.14159265358979323846264338327950', '28841971']

>>> print regex.split(r'(?<=0)(?=[1-9])', '10020')
['100', '20']

Upvotes: 1

randomir
randomir

Reputation: 18697

This simple regex in re.findall should suffice:

l = re.findall(r'[.1-9]+(?:0+|$)', s)

Note:

  • findall returns all non-overlapping matches of pattern in string, as a list of strings.

  • for each match we want the longest string of digits (or a dot) ending with at least one zero, or the end of the string

  • the zeros in the end should not be captured as another match (hence the (?:...)

Similarly for you second example:

>>> re.findall(r'[\d:]+(?:am|pm|$)', '3:18am5:19pm10:28am')
['3:18am', '5:19pm', '10:28am']

No need for lookahead/lookbehind magic, or non-greedy matching.

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89567

use re.findall:

l = re.findall(r'(?<![^0])[1-9.]+0*', s)

The key is to use a double negation: not preceded and that is not a zero (to match a preceding zero or the start of the string)

Upvotes: 1

Related Questions