Reputation: 122072
The input string is:
s = 'The\ue000 Project\ue000Gutenber g\ue000 E Book \ue000of\ue000 The\ue000 Ad vent ure s\ue000of\ue000 Sherlock\ue000 Holmes\n '
And the output string is:
o = 'The Project Gutenber_ _g E_ _Book of The Ad_ _vent_ _ure_ _s of Sherlock Holmes\n'
Note that from the input string, the \ue000
are the hard delimiters between words.
The aim is to do something like this:
[Cell 28]: replace the spaces with an underscore (to represent that there is a connection between two non-\ue000
characters)
[Cell 29]: then if there are _ \uE000
sequences, remove the underscore since there's no connections between the previous char ending with the underscore and the next word, (remember \ue000
is a hard word delimiter).
[Cell 30]: then replace the \ue000
with a space, so we're left with words with underscores that are either connected to an ending char of words or underscores hanging between two spaces:
[Cell 31]: deduplicate the spaces
[Cell 32]: delete the underscore hanging between two spaces
[Cell 33]: now that we're left with the underscores attached to the end of words, we can safely replace them with _ _
to indicate that the two sub-words are combinable.
The full set of replacements in the order described above:
text = text.replace(u'\n ', '\n')
text = text.replace(u' ', '_ ')
text = text.replace(u'_ \uE000', u' \uE000')
text = text.replace(u"\uE000", u' ')
text = text.replace(u' ', u' ')
text = text.replace(u' _ ', u' ')
text = text.replace(u'_ ', u'_ _')
text = text.replace(u' ', u' ')
Note: the first replacement for text.replace(u'\n ', '\n')
is necessary because the string could have been a full text file and simply using str.strip()
would not be sufficient to clear out the non-necessary spaces between the \n
and the new line.
Is there a less convoluted way to achieve the same output string that keeps the logic of why the replacements are done in the way described above?
Upvotes: 0
Views: 498
Reputation: 59148
I don't quite follow your penultimate paragraph regarding newlines, but that aside, a single re.sub()
is sufficient to get you most of the way:
>>> import re
>>>
>>> re.sub(r'[ \ue000]+', lambda m: ' ' if '\ue000' in m.group() else '_ _', s)
'The Project Gutenber_ _g E_ _Book of The Ad_ _vent_ _ure_ _s of Sherlock Holmes\n_ _'
This finds all sequences of \ue000
and spaces, then replaces those sequences using a lambda that returns either a space or '_ _'
depending on whether the match contains \ue000
.
After that, as far as I can tell (as I said, your penultimate paragraph is somewhat confusing), you just need to strip underscores and spaces:
>>> re.sub(r'[ \ue000]+', lambda m: ' ' if '\ue000' in m.group() else '_ _', s).strip('_ ')
'The Project Gutenber_ _g E_ _Book of The Ad_ _vent_ _ure_ _s of Sherlock Holmes\n'
Upvotes: 2