Reputation: 950
I have a string like this:
import re
text = """
Some stuff to keep <b>here</b>
CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE
Some more stuff to keep <b>here</b>
"""
And the expected output is:
Some stuff to keep <b>here</b>
CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE
Some more stuff to keep <b>here</b>
Here's a small subset of what I've tried:
# None of these work, and typically only replace the first or last occurence of <
re.sub(r'(?<=CODE)<(?=CODE)', r'_LT_', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)<(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)[<]*(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL|re.MULTILINE)
re.sub(r'(CODE.*?)<(.*?CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(CODE.*)<(.*CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
What I'd like to happen: All occurrences of <
between CODE
and CODE
to be replaced with _LT_
.
After spending the day on stackoverflow and regex101.com, I'm starting to think either it's not possible or I'm not smart enough to handle this.
Any help is tremendously appreciated!
Thanks in advance.
Upvotes: 1
Views: 91
Reputation: 18641
With regex:
import re
text = "\nSome stuff to keep <b>here</b>\n\nCODE\n<b>Replace gt and lt</b>\n<i>inside <script>this</script> code</i>\nCODE\n\nSome more stuff to keep <b>here</b>\n"
pattern = r"(?s)CODE.*?CODE"
print(re.sub(pattern, lambda x: x.group().replace('<','_LT_').replace('>','_GT_'), text))
See Python proof.
Results:
Some stuff to keep <b>here</b>
CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE
Some more stuff to keep <b>here</b>
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?s) set flags for this block (with . matching
\n) (case-sensitive) (with ^ and $
matching normally) (matching whitespace
and # normally)
--------------------------------------------------------------------------------
CODE 'CODE'
--------------------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
--------------------------------------------------------------------------------
CODE 'CODE'
Upvotes: 2
Reputation: 34
I'll update this answer in a few minutes with an only-regex solution but, meanwhile... Is not doing a split and then join strings a solution?
re.sub(regex, value, text.split("CODE\n")[1], flags)
EDIT! I found the answer, but it's a little bit hacky You can read the full description in this post: https://stackoverflow.com/a/11096811/8665327
Basically, the line you are looking for is this:
text = re.sub('\nCODE\n[^(CODE)]*\nCODE\n', lambda x: x.group(0).replace('<', '_LT_').replace('>', '_GT_'), text)
This will work with the first set of text placed between "CODE" text in its own line as long as there is no "CODE" string between them
will_work = """
<title>This will work</title>
CODE
<b>Replace this</b>
CODE
"""
wont_work = """
CODE
<b>This won't work</b>CODE
CODE
"""
Upvotes: 0
Reputation: 431
Here is my answer:
text = """
Some stuff to keep <b>here</b>
CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE
Some more stuff to keep <b>here</b>
"""
output = ''
for i in range(len(text.split('CODE'))):
if i % 2:
output += text.split('CODE')[i].replace('>', '_GT_').replace('<', '_LT_')
else:
output += text.split('CODE')[i]
print(output)
With this solution, every code block is being formated and added to the output
. This does not include regex
but this works.
Upvotes: 3