Daniel Yuen
Daniel Yuen

Reputation: 391

Python split string without splitting escaped character

Is there a way to split a string without splitting escaped character? For example, I have a string and want to split by ':' and not by '\:'

http\://www.example.url:ftp\://www.example.url

The result should be the following:

['http\://www.example.url' , 'ftp\://www.example.url']

Upvotes: 33

Views: 26283

Answers (11)

SEIAROTg
SEIAROTg

Reputation: 666

A correct and simpler version:

def split(s: str) -> list[str]:
    parts = re.split(r'(\\.)|:', s)
    parts = [p and p.removeprefix('\\') for p in parts]
    segments = itertools.groupby(parts, key=lambda p: p is None)
    return [''.join(segment) for is_delimiter, segment in segments if not is_delimiter]
>>> split(r'http\://www.example.url:ftp\://www.example.url')
['http://www.example.url', 'ftp://www.example.url']
>>> split('')
['']
>>> split('::')
['', '', '']
>>> split('a:')
['a', '']
>>> split(r':\:\\:\\\:\\\\:\\\\\:\\\\\\:')
['', ':\\', '\\:\\\\', '\\\\:\\\\\\', '']

Illustration:

  1. Split at any escape sequence (captured) or delimiter (uncaptured, effectively captured as None) into parts.
  2. Unescape all escape sequences.
  3. Split the list parts by real delimiter (None) into segments.
  4. Join each non-delimiter segments.

Upvotes: 0

mperon
mperon

Reputation: 79

I really know this is an old question, but i needed recently an function like this and not found any that was compliant with my requirements.

Rules:

  • Escape char only works when used with escape char or delimiter. Ex. if delimiter is / and escape are \ then (\a\b\c/abc bacame ['\a\b\c', 'abc']
  • Multiple escapes chars will be escaped. (\\ became \)

So, for the record and if someone look anything like, here my function proposal:

def str_escape_split(str_to_escape, delimiter=',', escape='\\'):
    """Splits an string using delimiter and escape chars

    Args:
        str_to_escape ([type]): The text to be splitted
        delimiter (str, optional): Delimiter used. Defaults to ','.
        escape (str, optional): The escape char. Defaults to '\'.

    Yields:
        [type]: a list of string to be escaped
    """
    if len(delimiter) > 1 or len(escape) > 1:
        raise ValueError("Either delimiter or escape must be an one char value")
    token = ''
    escaped = False
    for c in str_to_escape:
        if c == escape:
            if escaped:
                token += escape
                escaped = False
            else:
                escaped = True
            continue
        if c == delimiter:
            if not escaped:
                yield token
                token = ''
            else:
                token += c
                escaped = False
        else:
            if escaped:
                token += escape
                escaped = False
            token += c
    yield token

For the sake of sanity, i'm make some tests:

# The structure is:
# 'string_be_split_escaped', [list_with_result_expected]
tests_slash_escape = [
    ('r/casa\\/teste/g', ['r', 'casa/teste', 'g']),
    ('r/\\/teste/g', ['r', '/teste', 'g']),
    ('r/(([0-9])\\s+-\\s+([0-9]))/\\g<2>\\g<3>/g',
     ['r', '(([0-9])\\s+-\\s+([0-9]))', '\\g<2>\\g<3>', 'g']),
    ('r/\\s+/ /g', ['r', '\\s+', ' ', 'g']),
    ('r/\\.$//g', ['r', '\\.$', '', 'g']),
    ('u///g', ['u', '', '', 'g']),
    ('s/(/[/g', ['s', '(', '[', 'g']),
    ('s/)/]/g', ['s', ')', ']', 'g']),
    ('r/(\\.)\\1+/\\1/g', ['r', '(\\.)\\1+', '\\1', 'g']),
    ('r/(?<=\\d) +(?=\\d)/./', ['r', '(?<=\\d) +(?=\\d)', '.', '']),
    ('r/\\\\/\\\\\\/teste/g', ['r', '\\', '\\/teste', 'g'])
]

tests_bar_escape = [
    ('r/||/|||/teste/g', ['r', '|', '|/teste', 'g'])
]

def test(test_array, escape):
    """From input data, test escape functions

    Args:
        test_array ([type]): [description]
        escape ([type]): [description]
    """
    for t in test_array:
        resg = str_escape_split(t[0], '/', escape)
        res = list(resg)
        if res == t[1]:
            print(f"Test {t[0]}: {res} - Pass!")
        else:
            print(f"Test {t[0]}: {t[1]} != {res} - Failed! ")


def test_all():
    test(tests_slash_escape, '\\')
    test(tests_bar_escape, '|')


if __name__ == "__main__":
    test_all()

Upvotes: 0

casper.dcl
casper.dcl

Reputation: 14779

building on @user629923's suggestion, but being much simpler than other answers:

import re
DBL_ESC = "!double escape!"

s = r"Hello:World\:Goodbye\\:Cruel\\\:World"

map(lambda x: x.replace(DBL_ESC, r'\\'), re.split(r'(?<!\\):', s.replace(r'\\', DBL_ESC)))

Upvotes: 6

physicalattraction
physicalattraction

Reputation: 6858

I have created this method, which is inspired by Henry Keiter's answer, but has the following advantages:

  • Variable escape character and delimiter
  • Do not remove the escape character if it is actually not escaping something

This is the code:

def _split_string(self, string: str, delimiter: str, escape: str) -> [str]:
    result = []
    current_element = []
    iterator = iter(string)
    for character in iterator:
        if character == self.release_indicator:
            try:
                next_character = next(iterator)
                if next_character != delimiter and next_character != escape:
                    # Do not copy the escape character if it is inteded to escape either the delimiter or the
                    # escape character itself. Copy the escape character if it is not in use to escape one of these
                    # characters.
                    current_element.append(escape)
                current_element.append(next_character)
            except StopIteration:
                current_element.append(escape)
        elif character == delimiter:
            # split! (add current to the list and reset it)
            result.append(''.join(current_element))
            current_element = []
        else:
            current_element.append(character)
    result.append(''.join(current_element))
    return result

This is test code indicating the behavior:

def test_split_string(self):
    # Verify normal behavior
    self.assertListEqual(['A', 'B'], list(self.sut._split_string('A+B', '+', '?')))

    # Verify that escape character escapes the delimiter
    self.assertListEqual(['A+B'], list(self.sut._split_string('A?+B', '+', '?')))

    # Verify that the escape character escapes the escape character
    self.assertListEqual(['A?', 'B'], list(self.sut._split_string('A??+B', '+', '?')))

    # Verify that the escape character is just copied if it doesn't escape the delimiter or escape character
    self.assertListEqual(['A?+B'], list(self.sut._split_string('A?+B', '\'', '?')))

Upvotes: 0

Mohammad Azim
Mohammad Azim

Reputation: 2933

I think a simple C like parsing would be much more simple and robust.

def escaped_split(str, ch):
    if len(ch) > 1:
        raise ValueError('Expected split character. Found string!')
    out = []
    part = ''
    escape = False
    for i in range(len(str)):
        if not escape and str[i] == ch:
            out.append(part)
            part = ''
        else:
            part += str[i]
            escape = not escape and str[i] == '\\'
    if len(part):
        out.append(part)
    return out

Upvotes: 1

user3339408
user3339408

Reputation:

Here is an efficient solution that handles double-escapes correctly, i.e. any subsequent delimiter is not escaped. It ignores an incorrect single-escape as the last character of the string.

It is very efficient because it iterates over the input string exactly once, manipulating indices instead of copying strings around. Instead of constructing a list, it returns a generator.

def split_esc(string, delimiter):
    if len(delimiter) != 1:
        raise ValueError('Invalid delimiter: ' + delimiter)
    ln = len(string)
    i = 0
    j = 0
    while j < ln:
        if string[j] == '\\':
            if j + 1 >= ln:
                yield string[i:j]
                return
            j += 1
        elif string[j] == delimiter:
            yield string[i:j]
            i = j + 1
        j += 1
    yield string[i:j]

To allow for delimiters longer than a single character, simply advance i and j by the length of the delimiter in the "elif" case. This assumes that a single escape character escapes the entire delimiter, rather than a single character.

Tested with Python 3.5.1.

Upvotes: 4

titan
titan

Reputation: 11

There is no builtin function for that. Here's an efficient, general and tested function, which even supports delimiters of any length:

def escape_split(s, delim):
    i, res, buf = 0, [], ''
    while True:
        j, e = s.find(delim, i), 0
        if j < 0:  # end reached
            return res + [buf + s[i:]]  # add remainder
        while j - e and s[j - e - 1] == '\\':
            e += 1  # number of escapes
        d = e // 2  # number of double escapes
        if e != d * 2:  # odd number of escapes
            buf += s[i:j - d - 1] + s[j]  # add the escaped char
            i = j + 1  # and skip it
            continue  # add more to buf
        res.append(buf + s[i:j - d])
        i, buf = j + len(delim), ''  # start after delim

Upvotes: 1

Henry Keiter
Henry Keiter

Reputation: 17168

As Ignacio says, yes, but not trivially in one go. The issue is that you need lookback to determine if you're at an escaped delimiter or not, and the basic string.split doesn't provide that functionality.

If this isn't inside a tight loop so performance isn't a significant issue, you can do it by first splitting on the escaped delimiters, then performing the split, and then merging. Ugly demo code follows:

# Bear in mind this is not rigorously tested!
def escaped_split(s, delim):
    # split by escaped, then by not-escaped
    escaped_delim = '\\'+delim
    sections = [p.split(delim) for p in s.split(escaped_delim)] 
    ret = []
    prev = None
    for parts in sections: # for each list of "real" splits
        if prev is None:
            if len(parts) > 1:
                # Add first item, unless it's also the last in its section
                ret.append(parts[0])
        else:
            # Add the previous last item joined to the first item
            ret.append(escaped_delim.join([prev, parts[0]]))
        for part in parts[1:-1]:
            # Add all the items in the middle
            ret.append(part)
        prev = parts[-1]
    return ret

s = r'http\://www.example.url:ftp\://www.example.url'
print (escaped_split(s, ':')) 
# >>> ['http\\://www.example.url', 'ftp\\://www.example.url']

Alternately, it might be easier to follow the logic if you just split the string by hand.

def escaped_split(s, delim):
    ret = []
    current = []
    itr = iter(s)
    for ch in itr:
        if ch == '\\':
            try:
                # skip the next character; it has been escaped!
                current.append('\\')
                current.append(next(itr))
            except StopIteration:
                pass
        elif ch == delim:
            # split! (add current to the list and reset it)
            ret.append(''.join(current))
            current = []
        else:
            current.append(ch)
    ret.append(''.join(current))
    return ret

Note that this second version behaves slightly differently when it encounters double-escapes followed by a delimiter: this function allows escaped escape characters, so that escaped_split(r'a\\:b', ':') returns ['a\\\\', 'b'], because the first \ escapes the second one, leaving the : to be interpreted as a real delimiter. So that's something to watch out for.

Upvotes: 10

Taha Jahangir
Taha Jahangir

Reputation: 4902

The edited version of Henry's answer with Python3 compatibility, tests and fix some issues:

def split_unescape(s, delim, escape='\\', unescape=True):
    """
    >>> split_unescape('foo,bar', ',')
    ['foo', 'bar']
    >>> split_unescape('foo$,bar', ',', '$')
    ['foo,bar']
    >>> split_unescape('foo$$,bar', ',', '$', unescape=True)
    ['foo$', 'bar']
    >>> split_unescape('foo$$,bar', ',', '$', unescape=False)
    ['foo$$', 'bar']
    >>> split_unescape('foo$', ',', '$', unescape=True)
    ['foo$']
    """
    ret = []
    current = []
    itr = iter(s)
    for ch in itr:
        if ch == escape:
            try:
                # skip the next character; it has been escaped!
                if not unescape:
                    current.append(escape)
                current.append(next(itr))
            except StopIteration:
                if unescape:
                    current.append(escape)
        elif ch == delim:
            # split! (add current to the list and reset it)
            ret.append(''.join(current))
            current = []
        else:
            current.append(ch)
    ret.append(''.join(current))
    return ret

Upvotes: 6

user629923
user629923

Reputation: 541

There is a much easier way using a regex with a negative lookbehind assertion:

re.split(r'(?<!\\):', str)

Upvotes: 52

qaphla
qaphla

Reputation: 4733

Note that : doesn't appear to be a character that needs escaping.

The simplest way that I can think of to accomplish this is to split on the character, and then add it back in when it is escaped.

Sample code (In much need of some neatening.):

def splitNoEscapes(string, char):
    sections = string.split(char)
    sections = [i + (char if i[-1] == "\\" else "") for i in sections]
    result = ["" for i in sections]
    j = 0
    for s in sections:
        result[j] += s
        j += (1 if s[-1] != char else 0)
    return [i for i in result if i != ""]

Upvotes: -4

Related Questions