Reputation: 1072

Python docx Replace string in paragraph while keeping style

I need help replacing a string in a word document while keeping the formatting of the entire document.

I'm using python-docx, after reading the documentation, it works with entire paragraphs, so I loose formatting like words that are in bold or italics. Including the text to replace is in bold, and I would like to keep it that way. I'm using this code:

from docx import Document
def replace_string2(filename):
    doc = Document(filename)
    for p in doc.paragraphs:
        if 'Text to find and replace' in p.text:
            print 'SEARCH FOUND!!'
            text = p.text.replace('Text to find and replace', 'new text')
            style = p.style
            p.text = text
            p.style = style
    # doc.save(filename)
    doc.save('test.docx')
    return 1

So if I implement it and want something like (the paragraph containing the string to be replaced loses its formatting):

This is paragraph 1, and this is a text in bold.

This is paragraph 2, and I will replace old text

The current result is:

This is paragraph 1, and this is a text in bold.

This is paragraph 2, and I will replace new text

Upvotes: 22

Answers (10)

shawn caza

Reputation: 374

A lot of good ideas here. To add to what's already been said this post humbly offers A simple function to reduce the numbers of runs. Thereby reducing the chance the text you're searching for is broken up across multiple runs.

Note, it only accounts for the 5 styles specified. I also force colons at the start of a run into the previous run, and force everything between square brackets in a single run(relevant for my use case, but maybe not for yours).

def merge_superflous_runs(cell):
    """
        Where multiple consecutive runs have the same styles, combine them into one run.
        Also forces [square bracket content into a single run.]
        and moves any colons: at the start of a run to the end of the previous run.
    """
    for p in cell.paragraphs:
        # check styles of each run in p
        # If a run has the same styles as the previous run combine them
        previous_run_style = ''
        current_run_style = ''
        new_run_list = []

        for run in p.runs:

            if run.font.bold:
                current_run_style += 'bold'
            if run.font.italic:
                current_run_style += 'italic'
            if run.font.underline:
                current_run_style += 'underline'
            if run.font.superscript:
                current_run_style += 'superscript'
            if run.font.subscript:
                current_run_style += 'subscript'

            previous_run_open_bracket_pos = previous_run_style.rfind('[')
            previous_run_close_bracket_pos = previous_run_style.rfind(']')

            if len(new_run_list) and (current_run_style == previous_run_style or (previous_run_open_bracket_pos > -1 and previous_run_close_bracket_pos < previous_run_open_bracket_pos)):

                new_run_list[-1].text += run.text

            elif len(run.text.strip()) and run.text.strip()[0] == ':':
                new_run_list[-1].text += ':'
                colon_index = run.text.find(':')
                run.text = run.text[colon_index+1:]
                new_run_list.append(run)
                previous_run_style = current_run_style

            else:
                new_run_list.append(run)
                previous_run_style = current_run_style

            current_run_style = ''

            # remove current run
            run._element.getparent().remove(run._element)

        # add new runs
        for new_run in new_run_list:

            p._p.append(new_run._element)
    
    return cell

Upvotes: 0

Blitz Ritter

Reputation: 11

Try this link: https://thepythoncode.com/assistant/code-generator/python/

It generated a very similar code to the one you found out.

from docx import Document

def replace_word_in_docx(input_file, output_file, old_word, new_word):
"""
Replace a word in a Microsoft Word document while keeping the same 
styles.

Args:
input_file (str): Path to the input Word document.
output_file (str): Path to the output Word document.
old_word (str): The word to be replaced.
new_word (str): The new word to replace the old word.

Returns:
None
"""
doc = Document(input_file)

for paragraph in doc.paragraphs:
    if old_word in paragraph.text:
        inline = paragraph.runs
        for i in range(len(inline)):
            if old_word in inline[i].text:
                text = inline[i].text.replace(old_word, new_word)
                inline[i].text = text

doc.save(output_file)

Upvotes: 0

CG Zhang

Reputation: 314

According to the architecture of the DOCX document:

Text: docx>Paragraphs>runs>run
Text table: docx>tables>rows>cells>Paragraphs>runs>run
Header: docx>sections>header>Paragraphs>runs>run
Header tables: docx>sections>header>tables>row>cells>Paragraphs>runs>run

The footer is the same as the header, we can directly traverse the paragraph to find and replace our keywords, but this will cause the text format to be reset, so we can only traverse the words in the run and replace them. However, as our keywords may exceed the length range of the run, we cannot replace them successfully.

Therefore, I provide an idea here: firstly, take paragraph as unit, and mark the position of every character in paragraph through list; then, mark the position of every character in run through list; find keywords in paragraph, delete and replace them by character as unit by corresponding relation.

#!/usr/bin/env python 3.9
# -*- coding: utf-8 -*-
# @Time    : 2022/12/6 18:02 update
# @Author  : ZCG
# @File    : WordReplace.py
# @Software: PyCharm
# @Notice  :


from docx import Document
import os



class Execute:
    '''
        Execute Paragraphs KeyWords Replace
        paragraph: docx paragraph
    '''

    def __init__(self, paragraph):
        self.paragraph = paragraph


    def p_replace(self, x:int, key:str, value:str):
        '''
        paragraph replace
        The reason why you do not replace the text in a paragraph directly is that it will cause the original format to
        change. Replacing the text in runs will not cause the original format to change
        :param x:       paragraph id
        :param key:     Keywords that need to be replaced
        :param value:   The replaced keywords
        :return:
        '''
        # Gets the coordinate index values of all the characters in this paragraph [{run_index , char_index}]
        p_maps = [{"run": y, "char": z} for y, run in enumerate(self.paragraph.runs) for z, char in enumerate(list(run.text))]
        # Handle the number of times key occurs in this paragraph, and record the starting position in the list.
        # Here, while self.text.find(key) >= 0, the {"ab":"abc"} term will enter an endless loop
        # Takes a single paragraph as an independent body and gets an index list of key positions within the paragraph, or if the paragraph contains multiple keys, there are multiple index values
        k_idx = [s for s in range(len(self.paragraph.text)) if self.paragraph.text.find(key, s, len(self.paragraph.text)) == s]
        for i, start_idx in enumerate(reversed(k_idx)):       # Reverse order iteration
            end_idx = start_idx + len(key)                    # The end position of the keyword in this paragraph
            k_maps = p_maps[start_idx:end_idx]                # Map Slice List A list of dictionaries for sections that contain keywords in a paragraph
            self.r_replace(k_maps, value)
            print(f"\t |Paragraph {x+1: >3}, object {i+1: >3} replaced successfully! | {key} ===> {value}")


    def r_replace(self, k_maps:list, value:str):
        '''
        :param k_maps: The list of indexed dictionaries containing keywords， e.g:[{"run":15, "char":3},{"run":15, "char":4},{"run":16, "char":0}]
        :param value:
        :return:
        Accept arguments, removing the characters in k_maps from back to front, leaving the first one to replace with value
        Note: Must be removed in reverse order, otherwise the list length change will cause IndedxError: string index out of range
        '''
        for i, position in enumerate(reversed(k_maps), start=1):
            y, z = position["run"], position["char"]
            run:object = self.paragraph.runs[y]         # "k_maps" may contain multiple run ids, which need to be separated
            # Pit: Instead of the replace() method, str is converted to list after a single word to prevent run.text from making an error in some cases (e.g., a single run contains a duplicate word)
            thisrun = list(run.text)
            if i < len(k_maps):
                thisrun.pop(z)          # Deleting a corresponding word
            if i == len(k_maps):        # The last iteration (first word), that is, the number of iterations is equal to the length of k_maps
                thisrun[z] = value      # Replace the word in the corresponding position with the new content
            run.text = ''.join(thisrun) # Recover



class WordReplace:
    '''
        file: Microsoft Office word file，only support .docx type file
    '''

    def __init__(self, file):
        self.docx = Document(file)

    def body_content(self, replace_dict:dict):
        print("\t☺Processing keywords in the body...")
        for key, value in replace_dict.items():
            for x, paragraph in enumerate(self.docx.paragraphs):
                Execute(paragraph).p_replace(x, key, value)
        print("\t |Body keywords in the text are replaced!")


    def body_tables(self,replace_dict:dict):
        print("\t☺Processing keywords in the body'tables...")
        for key, value in replace_dict.items():
            for table in self.docx.tables:
                for row in table.rows:
                    for cell in row.cells:
                        for x, paragraph in enumerate(cell.paragraphs):
                            Execute(paragraph).p_replace(x, key, value)
        print("\t |Body'tables keywords in the text are replaced!")


    def header_content(self,replace_dict:dict):
        print("\t☺Processing keywords in the header'body ...")
        for key, value in replace_dict.items():
            for section in self.docx.sections:
                for x, paragraph in enumerate(section.header.paragraphs):
                    Execute(paragraph).p_replace(x, key, value)
        print("\t |Header'body keywords in the text are replaced!")


    def header_tables(self,replace_dict:dict):
        print("\t☺Processing keywords in the header'tables ...")
        for key, value in replace_dict.items():
            for section in self.docx.sections:
                for table in section.header.tables:
                    for row in table.rows:
                        for cell in row.cells:
                            for x, paragraph in enumerate(cell.paragraphs):
                                Execute(paragraph).p_replace(x, key, value)
        print("\t |Header'tables keywords in the text are replaced!")


    def footer_content(self, replace_dict:dict):
        print("\t☺Processing keywords in the footer'body ...")
        for key, value in replace_dict.items():
            for section in self.docx.sections:
                for x, paragraph in enumerate(section.footer.paragraphs):
                    Execute(paragraph).p_replace(x, key, value)
        print("\t |Footer'body keywords in the text are replaced!")


    def footer_tables(self, replace_dict:dict):
        print("\t☺Processing keywords in the footer'tables ...")
        for key, value in replace_dict.items():
            for section in self.docx.sections:
                for table in section.footer.tables:
                    for row in table.rows:
                        for cell in row.cells:
                            for x, paragraph in enumerate(cell.paragraphs):
                                Execute(paragraph).p_replace(x, key, value)
        print("\t |Footer'tables keywords in the text are replaced!")


    def save(self, filepath:str):
        '''
        :param filepath: File saving path
        :return:
        '''
        self.docx.save(filepath)


    @staticmethod
    def docx_list(dirPath):
        '''
        :param dirPath:
        :return: List of docx files in the current directory
        '''
        fileList = []
        for roots, dirs, files in os.walk(dirPath):
            for file in files:
                if file.endswith("docx") and file[0] != "~":  # Find the docx document and exclude temporary files
                    fileRoot = os.path.join(roots, file)
                    fileList.append(fileRoot)
        print("This directory finds a total of {0} related files!".format(len(fileList)))
        return fileList



def main():
    '''
    To use: Modify the values in replace dict and filedir
    replace_dict ：key:to be replaced, value:new content
    filedir ：Directory where docx files are stored. Subdirectories are supported
    '''
    # input section
    replace_dict = {
        "aaa":"bbb",
        "ccc":"ddd",
        }
    filedir = r"D:\Working Files\svn"

    # Call processing section
    for i, file in enumerate(WordReplace.docx_list(filedir),start=1):
        print(f"{i}、Processing file:{file}")
        wordreplace = WordReplace(file)
        wordreplace.header_content(replace_dict)
        wordreplace.header_tables(replace_dict)
        wordreplace.body_content(replace_dict)
        wordreplace.body_tables(replace_dict)
        wordreplace.footer_content(replace_dict)
        wordreplace.footer_tables(replace_dict)
        wordreplace.save(file)
        print(f'\t☻The document processing is complete!\n')



if __name__ == "__main__":
    main()
    print("All complete!")

Upvotes: 16

Gorgonzzalo

Reputation: 11

In case it serves to somebody, this worked for me. It does not keep the same format as in the opened document, but puts the format you want in case of replacing words going paragraph by paragraph:

from docx import Document
from docx.shared import Pt

word = "word to be replaced"
replacement = "word to replace"
document = Document()
style = document.styles['Normal']
font = style.font
font.name = 'Arial'
font.size = Pt(10)
for paragraph in document.paragraphs:
    if word in paragraph.text:
        paragraph.text = paragraph.text.replace(word,replacement)
        paragraph.style = document.styles['Normal']

Hope it serves!

Upvotes: 1

Henri

Reputation: 21

There's a new add-on module that accomplishes this exact functionality and it can be used with the docx library. When replacing strings, it keeps the original format of the text in the document.

Simply import docxedit and use the provided functions.

For example:

from docx import Document
import docxedit

document = Document()

# Replace all instances of the word 'Hello' in the document with 'Goodbye'
docxedit.replace_string(document, old_string='Hello', new_string='Goodbye')

# Replace all instances of the word 'Hello' in the document with 'Goodbye' but only
# up to paragraph 10
docxedit.replace_string_up_to_paragraph(document, old_string='Hello', new_string='Goodbye', 
                                        paragraph_number=10)

# Remove any line that contains the word 'Hello' along with the next 5 lines
docxedit.remove_lines(document, first_line='Hello', number_of_lines=5)

See more documentation here: https://github.com/henrihapponen/docxedit

Upvotes: 0

Ivan Bicalho

Reputation: 83

After analyzing a lot of suggestions and solutions across the web, I developed a python package based on all information I could find, plus a new strategy to keep all the format you have.

You can find more information at: https://github.com/ivanbicalho/python-docx-replace

Or, if you prefer, just install the package and use it, here's an example:

pip3 install python-docx-replace

from python_docx_replace.docx_replace import docx_replace

# get your document using python-docx
doc = Document("document.docx")

# call the replace function with your key value pairs
docx_replace(doc, name="Ivan", phone="+55123456789")

# do whatever you want after that, usually save the document
doc.save("replaced.docx")

Upvotes: 6

hmsy39

Reputation: 11

https://gist.github.com/heimoshuiyu/671a4dfbd13f7c279e85224a5b6726c0

This use a "shuttle" so it can find the key which cross multiple runs. This is similar to the "Replace All" behavior in MS Word

def shuttle_text(shuttle):
    t = ''
    for i in shuttle:
        t += i.text
    return t

def docx_replace(doc, data):
    for key in data:

        for table in doc.tables:
            for row in table.rows:
                for cell in row.cells:
                    if key in cell.text:
                        cell.text = cell.text.replace(key, data[key])

        for p in doc.paragraphs:

            begin = 0
            for end in range(len(p.runs)):

                shuttle = p.runs[begin:end+1]

                full_text = shuttle_text(shuttle)
                if key in full_text:
                    # print('Replace：', key, '->', data[key])
                    # print([i.text for i in shuttle])

                    # find the begin
                    index = full_text.index(key)
                    # print('full_text length', len(full_text), 'index:', index)
                    while index >= len(p.runs[begin].text):
                        index -= len(p.runs[begin].text)
                        begin += 1

                    shuttle = p.runs[begin:end+1]

                    # do replace
                    # print('before replace', [i.text for i in shuttle])
                    if key in shuttle[0].text:
                        shuttle[0].text = shuttle[0].text.replace(key, data[key])
                    else:
                        replace_begin_index = shuttle_text(shuttle).index(key)
                        replace_end_index = replace_begin_index + len(key)
                        replace_end_index_in_last_run = replace_end_index - len(shuttle_text(shuttle[:-1]))
                        shuttle[0].text = shuttle[0].text[:replace_begin_index] + data[key]

                        # clear middle runs
                        for i in shuttle[1:-1]:
                            i.text = ''

                        # keep last run
                        shuttle[-1].text = shuttle[-1].text[replace_end_index_in_last_run:]

                    # print('after replace', [i.text for i in shuttle])

                    # set begin to next
                    begin = end

# usage

doc = docx.Document('path/to/template.docx')
docx_replace(doc, dict(ItemOne='replacement text', ItemTwo="Some replacement text\nand some more")
doc.save('path/to/destination.docx')

Upvotes: 1

zain

Reputation: 49

from docx import Document

document = Document('old.docx')

dic = {
    '{{FULLNAME}}':'First Last',
    '{{FIRST}}':'First',
    '{{LAST}}' : 'Last',
}
for p in document.paragraphs:
    inline = p.runs
    for i in range(len(inline)):
        text = inline[i].text
        for key in dic.keys():
            if key in text:
                 text=text.replace(key,dic[key])
                 inline[i].text = text


document.save('new.docx')

Upvotes: 4

adejones

Reputation: 1060

This is what works for me to retain the text style when replacing text.

Based on Alo's answer and the fact the search text can be split over several runs, here's what worked for me to replace placeholder text in a template docx file. It checks all the document paragraphs and any table cell contents for the placeholders.

Once the search text is found in a paragraph it loops through it's runs identifying which runs contains the partial text of the search text, after which it inserts the replacement text in the first run then blanks out the remaining search text characters in the remaining runs.

I hope this helps someone. Here's the gist if anyone wants to improve it

Edit: I have subsequently discovered python-docx-template which allows jinja2 style templating within a docx template. Here's a link to the documentation

python3 python-docx python-docx-template

def docx_replace(doc, data):
    paragraphs = list(doc.paragraphs)
    for t in doc.tables:
        for row in t.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    paragraphs.append(paragraph)
    for p in paragraphs:
        for key, val in data.items():
            key_name = '${{{}}}'.format(key) # I'm using placeholders in the form ${PlaceholderName}
            if key_name in p.text:
                inline = p.runs
                # Replace strings and retain the same style.
                # The text to be replaced can be split over several runs so
                # search through, identify which runs need to have text replaced
                # then replace the text in those identified
                started = False
                key_index = 0
                # found_runs is a list of (inline index, index of match, length of match)
                found_runs = list()
                found_all = False
                replace_done = False
                for i in range(len(inline)):

                    # case 1: found in single run so short circuit the replace
                    if key_name in inline[i].text and not started:
                        found_runs.append((i, inline[i].text.find(key_name), len(key_name)))
                        text = inline[i].text.replace(key_name, str(val))
                        inline[i].text = text
                        replace_done = True
                        found_all = True
                        break

                    if key_name[key_index] not in inline[i].text and not started:
                        # keep looking ...
                        continue

                    # case 2: search for partial text, find first run
                    if key_name[key_index] in inline[i].text and inline[i].text[-1] in key_name and not started:
                        # check sequence
                        start_index = inline[i].text.find(key_name[key_index])
                        check_length = len(inline[i].text)
                        for text_index in range(start_index, check_length):
                            if inline[i].text[text_index] != key_name[key_index]:
                                # no match so must be false positive
                                break
                        if key_index == 0:
                            started = True
                        chars_found = check_length - start_index
                        key_index += chars_found
                        found_runs.append((i, start_index, chars_found))
                        if key_index != len(key_name):
                            continue
                        else:
                            # found all chars in key_name
                            found_all = True
                            break

                    # case 2: search for partial text, find subsequent run
                    if key_name[key_index] in inline[i].text and started and not found_all:
                        # check sequence
                        chars_found = 0
                        check_length = len(inline[i].text)
                        for text_index in range(0, check_length):
                            if inline[i].text[text_index] == key_name[key_index]:
                                key_index += 1
                                chars_found += 1
                            else:
                                break
                        # no match so must be end
                        found_runs.append((i, 0, chars_found))
                        if key_index == len(key_name):
                            found_all = True
                            break

                if found_all and not replace_done:
                    for i, item in enumerate(found_runs):
                        index, start, length = [t for t in item]
                        if i == 0:
                            text = inline[index].text.replace(inline[index].text[start:start + length], str(val))
                            inline[index].text = text
                        else:
                            text = inline[index].text.replace(inline[index].text[start:start + length], '')
                            inline[index].text = text
                # print(p.text)

# usage

doc = docx.Document('path/to/template.docx')
docx_replace(doc, dict(ItemOne='replacement text', ItemTwo="Some replacement text\nand some more")
doc.save('path/to/destination.docx')

Upvotes: 12

Alo

Reputation: 1072

I posted this question (even though I saw a few identical ones on here), because none of those (to my knowledge) solved the issue. There was one using a oodocx library, which I tried, but did not work. So I found a workaround.

The code is very similar, but the logic is: when I find the paragraph that contains the string I wish to replace, add another loop using runs. (this will only work if the string I wish to replace has the same formatting).

def replace_string(filename):
    doc = Document(filename)
    for p in doc.paragraphs:
        if 'old text' in p.text:
            inline = p.runs
            # Loop added to work with runs (strings with same style)
            for i in range(len(inline)):
                if 'old text' in inline[i].text:
                    text = inline[i].text.replace('old text', 'new text')
                    inline[i].text = text
            print p.text

    doc.save('dest1.docx')
    return 1

Upvotes: 28

Python docx Replace string in paragraph while keeping style

Answers (10)

Related Questions