alvas
alvas

Reputation: 122112

Spliting on every character except for preserved substring

Given the string

word = "These"

that contains the tuple

pair = ("h", "e")

the aim is to replace the word such that it splits on all character except for the pair tuple, i.e. output:

('T', 'he', 's', 'e')

I've tried:

word = 'These'
pair = ('h', 'e')
first, second = pair
pair_str = ''.join(pair)
pair_str = pair_str.replace('\\','\\\\')
pattern = re.compile(r'(?<!\S)' + re.escape(first + ' ' + second) + r'(?!\S)')
new_word = ' '.join(word)
new_word = pattern.sub(pair_str, new_word)
result = tuple(new_word.split())

Note that sometimes the pair tuple can contain slashes, backslashes or any other escape characters, thus the replace and escape in the above regex.

Is there a simpler way to achieve the same string replacement?


EDITED

Specifics from comments:

And is there a distinction between when both characters in the pair are unique and when they aren't?

Nope, they should be treated the same way.

Upvotes: 2

Views: 80

Answers (2)

aghast
aghast

Reputation: 15310

You can do it without using regular expressions:

import functools

word = 'These here when she'
pair = ('h', 'e')
digram = ''.join(pair)
parts = map(list, word.split(digram))
lex = lambda pre,post: post if pre is None else pre+[digram]+post

print(functools.reduce(lex, parts, None))

Upvotes: 1

Ry-
Ry-

Reputation: 224942

Match instead of splitting:

pattern = re.escape(''.join(pair)) + '|.'
result = tuple(re.findall(pattern, word))

The pattern is <pair>|., which matches the pair if possible and a single character* otherwise.

You can also do this without regular expressions:

import itertools

non_pairs = word.split(''.join(pair))
result = [(''.join(pair),)] * (2 * len(non_pairs) - 1)
result[::2] = non_pairs
result = tuple(itertools.chain(*result))

* It doesn’t match newlines, though; if you have those, pass re.DOTALL as a third argument to re.findall.

Upvotes: 3

Related Questions