Reputation: 1072
I need help replacing a string in a word document while keeping the formatting of the entire document.
I'm using python-docx, after reading the documentation, it works with entire paragraphs, so I loose formatting like words that are in bold or italics. Including the text to replace is in bold, and I would like to keep it that way. I'm using this code:
from docx import Document
def replace_string2(filename):
doc = Document(filename)
for p in doc.paragraphs:
if 'Text to find and replace' in p.text:
print 'SEARCH FOUND!!'
text = p.text.replace('Text to find and replace', 'new text')
style = p.style
p.text = text
p.style = style
# doc.save(filename)
doc.save('test.docx')
return 1
So if I implement it and want something like (the paragraph containing the string to be replaced loses its formatting):
This is paragraph 1, and this is a text in bold.
This is paragraph 2, and I will replace old text
The current result is:
This is paragraph 1, and this is a text in bold.
This is paragraph 2, and I will replace new text
Upvotes: 22
Views: 63384
Reputation: 374
A lot of good ideas here. To add to what's already been said this post humbly offers A simple function to reduce the numbers of runs. Thereby reducing the chance the text you're searching for is broken up across multiple runs.
Note, it only accounts for the 5 styles specified. I also force colons at the start of a run into the previous run, and force everything between square brackets in a single run(relevant for my use case, but maybe not for yours).
def merge_superflous_runs(cell):
"""
Where multiple consecutive runs have the same styles, combine them into one run.
Also forces [square bracket content into a single run.]
and moves any colons: at the start of a run to the end of the previous run.
"""
for p in cell.paragraphs:
# check styles of each run in p
# If a run has the same styles as the previous run combine them
previous_run_style = ''
current_run_style = ''
new_run_list = []
for run in p.runs:
if run.font.bold:
current_run_style += 'bold'
if run.font.italic:
current_run_style += 'italic'
if run.font.underline:
current_run_style += 'underline'
if run.font.superscript:
current_run_style += 'superscript'
if run.font.subscript:
current_run_style += 'subscript'
previous_run_open_bracket_pos = previous_run_style.rfind('[')
previous_run_close_bracket_pos = previous_run_style.rfind(']')
if len(new_run_list) and (current_run_style == previous_run_style or (previous_run_open_bracket_pos > -1 and previous_run_close_bracket_pos < previous_run_open_bracket_pos)):
new_run_list[-1].text += run.text
elif len(run.text.strip()) and run.text.strip()[0] == ':':
new_run_list[-1].text += ':'
colon_index = run.text.find(':')
run.text = run.text[colon_index+1:]
new_run_list.append(run)
previous_run_style = current_run_style
else:
new_run_list.append(run)
previous_run_style = current_run_style
current_run_style = ''
# remove current run
run._element.getparent().remove(run._element)
# add new runs
for new_run in new_run_list:
p._p.append(new_run._element)
return cell
Upvotes: 0
Reputation: 11
Try this link: https://thepythoncode.com/assistant/code-generator/python/
It generated a very similar code to the one you found out.
from docx import Document
def replace_word_in_docx(input_file, output_file, old_word, new_word):
"""
Replace a word in a Microsoft Word document while keeping the same
styles.
Args:
input_file (str): Path to the input Word document.
output_file (str): Path to the output Word document.
old_word (str): The word to be replaced.
new_word (str): The new word to replace the old word.
Returns:
None
"""
doc = Document(input_file)
for paragraph in doc.paragraphs:
if old_word in paragraph.text:
inline = paragraph.runs
for i in range(len(inline)):
if old_word in inline[i].text:
text = inline[i].text.replace(old_word, new_word)
inline[i].text = text
doc.save(output_file)
Upvotes: 0
Reputation: 314
According to the architecture of the DOCX document:
The footer is the same as the header, we can directly traverse the paragraph to find and replace our keywords, but this will cause the text format to be reset, so we can only traverse the words in the run and replace them. However, as our keywords may exceed the length range of the run, we cannot replace them successfully.
Therefore, I provide an idea here: firstly, take paragraph as unit, and mark the position of every character in paragraph through list; then, mark the position of every character in run through list; find keywords in paragraph, delete and replace them by character as unit by corresponding relation.
#!/usr/bin/env python 3.9
# -*- coding: utf-8 -*-
# @Time : 2022/12/6 18:02 update
# @Author : ZCG
# @File : WordReplace.py
# @Software: PyCharm
# @Notice :
from docx import Document
import os
class Execute:
'''
Execute Paragraphs KeyWords Replace
paragraph: docx paragraph
'''
def __init__(self, paragraph):
self.paragraph = paragraph
def p_replace(self, x:int, key:str, value:str):
'''
paragraph replace
The reason why you do not replace the text in a paragraph directly is that it will cause the original format to
change. Replacing the text in runs will not cause the original format to change
:param x: paragraph id
:param key: Keywords that need to be replaced
:param value: The replaced keywords
:return:
'''
# Gets the coordinate index values of all the characters in this paragraph [{run_index , char_index}]
p_maps = [{"run": y, "char": z} for y, run in enumerate(self.paragraph.runs) for z, char in enumerate(list(run.text))]
# Handle the number of times key occurs in this paragraph, and record the starting position in the list.
# Here, while self.text.find(key) >= 0, the {"ab":"abc"} term will enter an endless loop
# Takes a single paragraph as an independent body and gets an index list of key positions within the paragraph, or if the paragraph contains multiple keys, there are multiple index values
k_idx = [s for s in range(len(self.paragraph.text)) if self.paragraph.text.find(key, s, len(self.paragraph.text)) == s]
for i, start_idx in enumerate(reversed(k_idx)): # Reverse order iteration
end_idx = start_idx + len(key) # The end position of the keyword in this paragraph
k_maps = p_maps[start_idx:end_idx] # Map Slice List A list of dictionaries for sections that contain keywords in a paragraph
self.r_replace(k_maps, value)
print(f"\t |Paragraph {x+1: >3}, object {i+1: >3} replaced successfully! | {key} ===> {value}")
def r_replace(self, k_maps:list, value:str):
'''
:param k_maps: The list of indexed dictionaries containing keywords, e.g:[{"run":15, "char":3},{"run":15, "char":4},{"run":16, "char":0}]
:param value:
:return:
Accept arguments, removing the characters in k_maps from back to front, leaving the first one to replace with value
Note: Must be removed in reverse order, otherwise the list length change will cause IndedxError: string index out of range
'''
for i, position in enumerate(reversed(k_maps), start=1):
y, z = position["run"], position["char"]
run:object = self.paragraph.runs[y] # "k_maps" may contain multiple run ids, which need to be separated
# Pit: Instead of the replace() method, str is converted to list after a single word to prevent run.text from making an error in some cases (e.g., a single run contains a duplicate word)
thisrun = list(run.text)
if i < len(k_maps):
thisrun.pop(z) # Deleting a corresponding word
if i == len(k_maps): # The last iteration (first word), that is, the number of iterations is equal to the length of k_maps
thisrun[z] = value # Replace the word in the corresponding position with the new content
run.text = ''.join(thisrun) # Recover
class WordReplace:
'''
file: Microsoft Office word file,only support .docx type file
'''
def __init__(self, file):
self.docx = Document(file)
def body_content(self, replace_dict:dict):
print("\t☺Processing keywords in the body...")
for key, value in replace_dict.items():
for x, paragraph in enumerate(self.docx.paragraphs):
Execute(paragraph).p_replace(x, key, value)
print("\t |Body keywords in the text are replaced!")
def body_tables(self,replace_dict:dict):
print("\t☺Processing keywords in the body'tables...")
for key, value in replace_dict.items():
for table in self.docx.tables:
for row in table.rows:
for cell in row.cells:
for x, paragraph in enumerate(cell.paragraphs):
Execute(paragraph).p_replace(x, key, value)
print("\t |Body'tables keywords in the text are replaced!")
def header_content(self,replace_dict:dict):
print("\t☺Processing keywords in the header'body ...")
for key, value in replace_dict.items():
for section in self.docx.sections:
for x, paragraph in enumerate(section.header.paragraphs):
Execute(paragraph).p_replace(x, key, value)
print("\t |Header'body keywords in the text are replaced!")
def header_tables(self,replace_dict:dict):
print("\t☺Processing keywords in the header'tables ...")
for key, value in replace_dict.items():
for section in self.docx.sections:
for table in section.header.tables:
for row in table.rows:
for cell in row.cells:
for x, paragraph in enumerate(cell.paragraphs):
Execute(paragraph).p_replace(x, key, value)
print("\t |Header'tables keywords in the text are replaced!")
def footer_content(self, replace_dict:dict):
print("\t☺Processing keywords in the footer'body ...")
for key, value in replace_dict.items():
for section in self.docx.sections:
for x, paragraph in enumerate(section.footer.paragraphs):
Execute(paragraph).p_replace(x, key, value)
print("\t |Footer'body keywords in the text are replaced!")
def footer_tables(self, replace_dict:dict):
print("\t☺Processing keywords in the footer'tables ...")
for key, value in replace_dict.items():
for section in self.docx.sections:
for table in section.footer.tables:
for row in table.rows:
for cell in row.cells:
for x, paragraph in enumerate(cell.paragraphs):
Execute(paragraph).p_replace(x, key, value)
print("\t |Footer'tables keywords in the text are replaced!")
def save(self, filepath:str):
'''
:param filepath: File saving path
:return:
'''
self.docx.save(filepath)
@staticmethod
def docx_list(dirPath):
'''
:param dirPath:
:return: List of docx files in the current directory
'''
fileList = []
for roots, dirs, files in os.walk(dirPath):
for file in files:
if file.endswith("docx") and file[0] != "~": # Find the docx document and exclude temporary files
fileRoot = os.path.join(roots, file)
fileList.append(fileRoot)
print("This directory finds a total of {0} related files!".format(len(fileList)))
return fileList
def main():
'''
To use: Modify the values in replace dict and filedir
replace_dict :key:to be replaced, value:new content
filedir :Directory where docx files are stored. Subdirectories are supported
'''
# input section
replace_dict = {
"aaa":"bbb",
"ccc":"ddd",
}
filedir = r"D:\Working Files\svn"
# Call processing section
for i, file in enumerate(WordReplace.docx_list(filedir),start=1):
print(f"{i}、Processing file:{file}")
wordreplace = WordReplace(file)
wordreplace.header_content(replace_dict)
wordreplace.header_tables(replace_dict)
wordreplace.body_content(replace_dict)
wordreplace.body_tables(replace_dict)
wordreplace.footer_content(replace_dict)
wordreplace.footer_tables(replace_dict)
wordreplace.save(file)
print(f'\t☻The document processing is complete!\n')
if __name__ == "__main__":
main()
print("All complete!")
Upvotes: 16
Reputation: 11
In case it serves to somebody, this worked for me. It does not keep the same format as in the opened document, but puts the format you want in case of replacing words going paragraph by paragraph:
from docx import Document
from docx.shared import Pt
word = "word to be replaced"
replacement = "word to replace"
document = Document()
style = document.styles['Normal']
font = style.font
font.name = 'Arial'
font.size = Pt(10)
for paragraph in document.paragraphs:
if word in paragraph.text:
paragraph.text = paragraph.text.replace(word,replacement)
paragraph.style = document.styles['Normal']
Hope it serves!
Upvotes: 1
Reputation: 11
There's a new add-on module that accomplishes this exact functionality and it can be used with the docx
library. When replacing strings, it keeps the original format of the text in the document.
Simply import docxedit
and use the provided functions.
For example:
from docx import Document
import docxedit
document = Document()
# Replace all instances of the word 'Hello' in the document with 'Goodbye'
docxedit.replace_string(document, old_string='Hello', new_string='Goodbye')
# Replace all instances of the word 'Hello' in the document with 'Goodbye' but only
# up to paragraph 10
docxedit.replace_string_up_to_paragraph(document, old_string='Hello', new_string='Goodbye',
paragraph_number=10)
# Remove any line that contains the word 'Hello' along with the next 5 lines
docxedit.remove_lines(document, first_line='Hello', number_of_lines=5)
See more documentation here: https://github.com/henrihapponen/docxedit
Upvotes: 0
Reputation: 73
After analyzing a lot of suggestions and solutions across the web, I developed a python package based on all information I could find, plus a new strategy to keep all the format you have.
You can find more information at: https://github.com/ivanbicalho/python-docx-replace
Or, if you prefer, just install the package and use it, here's an example:
pip3 install python-docx-replace
from python_docx_replace.docx_replace import docx_replace
# get your document using python-docx
doc = Document("document.docx")
# call the replace function with your key value pairs
docx_replace(doc, name="Ivan", phone="+55123456789")
# do whatever you want after that, usually save the document
doc.save("replaced.docx")
Upvotes: 5
Reputation: 11
https://gist.github.com/heimoshuiyu/671a4dfbd13f7c279e85224a5b6726c0
This use a "shuttle" so it can find the key which cross multiple runs. This is similar to the "Replace All" behavior in MS Word
def shuttle_text(shuttle):
t = ''
for i in shuttle:
t += i.text
return t
def docx_replace(doc, data):
for key in data:
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
if key in cell.text:
cell.text = cell.text.replace(key, data[key])
for p in doc.paragraphs:
begin = 0
for end in range(len(p.runs)):
shuttle = p.runs[begin:end+1]
full_text = shuttle_text(shuttle)
if key in full_text:
# print('Replace:', key, '->', data[key])
# print([i.text for i in shuttle])
# find the begin
index = full_text.index(key)
# print('full_text length', len(full_text), 'index:', index)
while index >= len(p.runs[begin].text):
index -= len(p.runs[begin].text)
begin += 1
shuttle = p.runs[begin:end+1]
# do replace
# print('before replace', [i.text for i in shuttle])
if key in shuttle[0].text:
shuttle[0].text = shuttle[0].text.replace(key, data[key])
else:
replace_begin_index = shuttle_text(shuttle).index(key)
replace_end_index = replace_begin_index + len(key)
replace_end_index_in_last_run = replace_end_index - len(shuttle_text(shuttle[:-1]))
shuttle[0].text = shuttle[0].text[:replace_begin_index] + data[key]
# clear middle runs
for i in shuttle[1:-1]:
i.text = ''
# keep last run
shuttle[-1].text = shuttle[-1].text[replace_end_index_in_last_run:]
# print('after replace', [i.text for i in shuttle])
# set begin to next
begin = end
# usage
doc = docx.Document('path/to/template.docx')
docx_replace(doc, dict(ItemOne='replacement text', ItemTwo="Some replacement text\nand some more")
doc.save('path/to/destination.docx')
Upvotes: 1
Reputation: 49
from docx import Document
document = Document('old.docx')
dic = {
'{{FULLNAME}}':'First Last',
'{{FIRST}}':'First',
'{{LAST}}' : 'Last',
}
for p in document.paragraphs:
inline = p.runs
for i in range(len(inline)):
text = inline[i].text
for key in dic.keys():
if key in text:
text=text.replace(key,dic[key])
inline[i].text = text
document.save('new.docx')
Upvotes: 4
Reputation: 1050
This is what works for me to retain the text style when replacing text.
Based on Alo
's answer and the fact the search text can be split over several runs, here's what worked for me to replace placeholder text in a template docx file. It checks all the document paragraphs and any table cell contents for the placeholders.
Once the search text is found in a paragraph it loops through it's runs identifying which runs contains the partial text of the search text, after which it inserts the replacement text in the first run then blanks out the remaining search text characters in the remaining runs.
I hope this helps someone. Here's the gist if anyone wants to improve it
Edit:
I have subsequently discovered python-docx-template
which allows jinja2 style templating within a docx template. Here's a link to the documentation
python3 python-docx python-docx-template
def docx_replace(doc, data):
paragraphs = list(doc.paragraphs)
for t in doc.tables:
for row in t.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
paragraphs.append(paragraph)
for p in paragraphs:
for key, val in data.items():
key_name = '${{{}}}'.format(key) # I'm using placeholders in the form ${PlaceholderName}
if key_name in p.text:
inline = p.runs
# Replace strings and retain the same style.
# The text to be replaced can be split over several runs so
# search through, identify which runs need to have text replaced
# then replace the text in those identified
started = False
key_index = 0
# found_runs is a list of (inline index, index of match, length of match)
found_runs = list()
found_all = False
replace_done = False
for i in range(len(inline)):
# case 1: found in single run so short circuit the replace
if key_name in inline[i].text and not started:
found_runs.append((i, inline[i].text.find(key_name), len(key_name)))
text = inline[i].text.replace(key_name, str(val))
inline[i].text = text
replace_done = True
found_all = True
break
if key_name[key_index] not in inline[i].text and not started:
# keep looking ...
continue
# case 2: search for partial text, find first run
if key_name[key_index] in inline[i].text and inline[i].text[-1] in key_name and not started:
# check sequence
start_index = inline[i].text.find(key_name[key_index])
check_length = len(inline[i].text)
for text_index in range(start_index, check_length):
if inline[i].text[text_index] != key_name[key_index]:
# no match so must be false positive
break
if key_index == 0:
started = True
chars_found = check_length - start_index
key_index += chars_found
found_runs.append((i, start_index, chars_found))
if key_index != len(key_name):
continue
else:
# found all chars in key_name
found_all = True
break
# case 2: search for partial text, find subsequent run
if key_name[key_index] in inline[i].text and started and not found_all:
# check sequence
chars_found = 0
check_length = len(inline[i].text)
for text_index in range(0, check_length):
if inline[i].text[text_index] == key_name[key_index]:
key_index += 1
chars_found += 1
else:
break
# no match so must be end
found_runs.append((i, 0, chars_found))
if key_index == len(key_name):
found_all = True
break
if found_all and not replace_done:
for i, item in enumerate(found_runs):
index, start, length = [t for t in item]
if i == 0:
text = inline[index].text.replace(inline[index].text[start:start + length], str(val))
inline[index].text = text
else:
text = inline[index].text.replace(inline[index].text[start:start + length], '')
inline[index].text = text
# print(p.text)
# usage
doc = docx.Document('path/to/template.docx')
docx_replace(doc, dict(ItemOne='replacement text', ItemTwo="Some replacement text\nand some more")
doc.save('path/to/destination.docx')
Upvotes: 12
Reputation: 1072
I posted this question (even though I saw a few identical ones on here), because none of those (to my knowledge) solved the issue. There was one using a oodocx library, which I tried, but did not work. So I found a workaround.
The code is very similar, but the logic is: when I find the paragraph that contains the string I wish to replace, add another loop using runs. (this will only work if the string I wish to replace has the same formatting).
def replace_string(filename):
doc = Document(filename)
for p in doc.paragraphs:
if 'old text' in p.text:
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
if 'old text' in inline[i].text:
text = inline[i].text.replace('old text', 'new text')
inline[i].text = text
print p.text
doc.save('dest1.docx')
return 1
Upvotes: 28