Reputation: 1646
I've got a string (Python 2.7.3) which is rendered as a template in Django but I don't think this is specific to Django. The string comes from the document.xml file inside a docx file. I'm extacting the document xml rendering it and putting it back inside the docx for some simple mail merge type stuff.
One of the issues, other than the obvious limitations to what template tags I can use, is that Word likes to drop in a whole bunch of xml if you edit the text in Word.
For my needs, I'd be successful if I could
"
between double curly braces and replace with a quote "
.I'd like to replace the "
with "
in something like the following:
word_docxml = 'some text here {{form.letterdate|date:"Y-m-d"}} and more text'
I was reading over these:
but having trouble putting it together.
How do I remove/strip everything inside and including the < >
in between {{ }}
's in a mess like the following:
<w:rPr>
<w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/>
<w:color w:val="00000A"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
<w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/>
</w:rPr>
<w:t>{{form.</w:t>undefined</w:r>undefined<w:r>
<w:rPr>
<w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/>
<w:b w:val="false"/>
<w:bCs w:val="false"/>
<w:color w:val="00000A"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
<w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/>
</w:rPr>
<w:t>L</w:t>undefined</w:r>undefined<w:r>
<w:rPr>
<w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/>
<w:color w:val="00000A"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
<w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/>
</w:rPr>
<w:t>etterDate.value|date:"Y-m-d"}}</w:t>undefined</w:r>
which would result in the following (apologies, I can't seem to highlight the area of interest):
<w:rPr>
<w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/>
<w:color w:val="00000A"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
<w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/>
</w:rPr>
<w:t>{{form.LetterDate.value|date:"Y-m-d"}}</w:t>undefined</w:r>
How does one handle this? Is regex the way to go; if so, how to put the command together?
This is not a duplicate of Between double curly braces: replace particular text because it has no mention of handling a double curly brace for start and end for the search range (that was my real problem, I've read through many examples and was unable to get the pattern for substitution formatted correctly). The other post is about parsing a subset of html entities in XHTML; there is no XHTML parsing required, mentioned or questioned in my post. This post here asks how to remove and/or replace a repeating pattern between two other known start/end patterns. I provided a brief background, two concrete examples from the simple to the complex hoping to learn how to accomplish my current task - my best hope was to get part A explained and apply the method myself to part B. I got intelligent discussion and super replies from helpful members of the community. My post doesn't involve HTML at all as the template I'm rendering in Django is added back to a docx archive and saved to a filestore. It is not a duplicate (of the marked duplicate anyhow).
Upvotes: 2
Views: 4439
Reputation: 2553
Yes, regex is great for this!
a) Use this:
re.sub(r"(\{\{[^}]+}\})", lambda m: re.sub(""", '"', m.group(1)), word_docxml)
Results:
>>> word_docxml = 'some text here {{form.letterdate|date:"Y-m-d"}} and " more text'
>>> re.sub(r"(\{\{[^}]+}\})", lambda m: re.sub(""", '"', m.group(1)), word_docxml)
'some text here {{form.letterdate|date:"Y-m-d"}} and " more text'
b) More of the same, just matching different content inside the braces;
re.sub(r"(\{\{[^}]+}\})", lambda m: re.sub("<[^>]+>", "", m.group(1)), s)
Results:
>>> s = """<w:rPr><w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/><w:color w:val="00000A"/><w:sz w:val="22"/><w:szCs w:val="22"/><w:US" w:eastAsia="en-US" w:bidi="ar-SA"/></w:rPr><w:t>{{form.</w:t></w:r><w:r><w:rPr><w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/><e"/><w:bCs w:val="false"/><w:color w:val="00000A"/><w:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/></w:rPr><w:t>L</w:t></w<w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/><w:color w:val="00000A"/><w:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:val="en-US"-US" w:bidi="ar-SA"/></w:rPr><w:t>etterDate.value|date:"Y-m-d"}}</w:t></w:r>"""
>>> re.sub(r"(\{\{[^}]+}\})", lambda m: re.sub("<[^>]+>", "", m.group(1)), s)
'<w:rPr><w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/><w:color w:val="00000A"/><w:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/></w:rPr><w:t>{{form.LetterDate.value|date:"Y-m-d"}}</w:t></w:r>'
Explanation, since you asked for guidance, not just the answer;
re.sub(r"(\{\{[^}]+}\})", lambda m: re.sub(""", '"', m.group(1)), word_docxml)
The way this works is to first match a double brace interval. The lambda expression just takes the group found in that match and does the replace of the relevant content.
The smaller regexes explained:
" # Just matching that, nothing fancy
A pattern to match tags;
< # Opening of tag
[^>]+ # Followed by 1 or more characters that are not closing tags
> # Followed by a closing tag
Upvotes: 1
Reputation: 14079
One must be careful when testing a regex that it doesn't match too much (false positives). Given your complex input this becomes more important.
For example, a regex should not match
"
below
test { " }}text
test " }}
As for your second question I would do it in 2 passes to keep the regex nice 'n simple
First use this regex to match content between {{ and }}
\{\{(.*?)\}\}
Now apply a function to only the contents of group 1. I am familiar with .NET which allows this and I hope your language does too
The function to apply is a again a replacement regex with nothing
<[^>]*>
I hope I got the Python dialect right.
The first question can use the same idea.
Upvotes: 0