alvas
alvas

Reputation: 122072

Simplifying regex to replace underscores adhering to some substring logic

The input string is:

s = 'The\ue000 Project\ue000Gutenber g\ue000 E Book \ue000of\ue000 The\ue000 Ad vent ure s\ue000of\ue000 Sherlock\ue000 Holmes\n '

And the output string is:

o = 'The Project Gutenber_ _g E_ _Book of The Ad_ _vent_ _ure_ _s of Sherlock Holmes\n'

Note that from the input string, the \ue000 are the hard delimiters between words.

The aim is to do something like this:

enter image description here

The full set of replacements in the order described above:

text = text.replace(u'\n ', '\n')
text = text.replace(u' ', '_ ')
text = text.replace(u'_ \uE000', u' \uE000')
text = text.replace(u"\uE000", u' ')
text = text.replace(u'  ', u' ')
text = text.replace(u' _ ', u' ')
text = text.replace(u'_ ', u'_ _')
text = text.replace(u'  ', u' ')

Note: the first replacement for text.replace(u'\n ', '\n') is necessary because the string could have been a full text file and simply using str.strip() would not be sufficient to clear out the non-necessary spaces between the \n and the new line.

Is there a less convoluted way to achieve the same output string that keeps the logic of why the replacements are done in the way described above?

Upvotes: 0

Views: 498

Answers (1)

schesis
schesis

Reputation: 59148

I don't quite follow your penultimate paragraph regarding newlines, but that aside, a single re.sub() is sufficient to get you most of the way:

>>> import re
>>> 
>>> re.sub(r'[ \ue000]+', lambda m: ' ' if '\ue000' in m.group() else '_ _', s)
'The Project Gutenber_ _g E_ _Book of The Ad_ _vent_ _ure_ _s of Sherlock Holmes\n_ _'

This finds all sequences of \ue000 and spaces, then replaces those sequences using a lambda that returns either a space or '_ _' depending on whether the match contains \ue000.

After that, as far as I can tell (as I said, your penultimate paragraph is somewhat confusing), you just need to strip underscores and spaces:

>>> re.sub(r'[ \ue000]+', lambda m: ' ' if '\ue000' in m.group() else '_ _', s).strip('_ ')
'The Project Gutenber_ _g E_ _Book of The Ad_ _vent_ _ure_ _s of Sherlock Holmes\n'

Upvotes: 2

Related Questions