Highlighting the certain words in a word document with python

Question

I would like to highlight all the given words in a given word document, however, I can only highlight the first word in a sentence...

Example: Let's assume that I have the following words in a word document, and I would like to highlight the following words: Approximate Pending May. While it works fine in the first four lines, on the fifth line I can only highlight the "Pending"

Approximate 
Pending spending 
May
May
Pending Approximate xx sit May

Here is my code, could you please help me out with this?

from docx import Document
from docx.enum.text import WD_COLOR_INDEX
import pandas as pd
import os
import re


path = r"C:\Users\files\"
input_file =  r"C:\Users\files\\Dictionary.xlsx"


# List of words
df = pd.read_excel(input_file) 
my_list = df['dictionary'].tolist()
list_2 = [dic.capitalize() for dic in my_list]
list_3 = [dic.lower() for dic in my_list]
my_list.extend(list_2)
my_list.extend(list_3)


for filename in os.listdir(path):
    if filename.endswith(".docx"):
        file = "C:\Users\files\" + filename
        print(file)
        doc = Document(file)
        for para in doc.paragraphs:
            for phrase in my_list:
                #start = para.text.find(phrase)
                x = para.text
                starts = re.findall('\b' + phrase + '\b', x)
                #print(start)
                if len(starts)>0:
                    #print(starts)
                    #if start > -1 :
                    start = para.text.find(phrase)
                    pre = para.text[:start]
                    post = para.text[start+len(phrase):]
                    para.text = pre
                    para.add_run(phrase)
                    para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
                    para.add_run(post)
          
        doc.save( file )

Timus · Accepted Answer

I think I found the problem.

Here you insert pure text without any formatting into the paragraph which overrides any formatting done previously in the pre part:

pre = para.text[:start]
...
para.text = pre

And here happens the same for the post part:

post = para.text[start + len(phrase):]
...
para.add_run(post)

It always overrides any highlighting done in the previous iteration with pure text. The text property of paragraph gives you only a string without the formatting. And from the documentation of the text property of the paragraph object:

Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text ... Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed.

You can see this if you change the order of the elements in my_list. Only the last element will be highlighted. I started with

my_list = ['approximate', 'may', 'pending']

and the result was a highlighted Pending. Then I switched to

my_list = ['approximate', 'pending', 'may']

and the result was a highlighted May. I'm referring here to the last line in your example.

EDIT: Here’s an attempt to fix it.

I’ve replaced

# List of words
df = pd.read_excel(input_file) 
my_list = df['dictionary'].tolist()
list_2 = [dic.capitalize() for dic in my_list]
list_3 = [dic.lower() for dic in my_list]
my_list.extend(list_2)
my_list.extend(list_3)

with

# List of words
df = pd.read_excel(input_file) 
my_list = df['dictionary'].tolist()

# Setup regex
patterns = [r'\b' + word + r'\b' for word in my_list]
re_highlight = re.compile('(' + '|'.join(p for p in patterns) + ')+',
                          re.IGNORECASE)

and

for filename in os.listdir(path):
    if filename.endswith(".docx"):
        file = "C:\Users\files\" + filename
        print(file)
        doc = Document(file)
        for para in doc.paragraphs:
            for phrase in my_list:
                #start = para.text.find(phrase)
                x = para.text
                starts = re.findall('\b' + phrase + '\b', x)
                #print(start)
                if len(starts)>0:
                    #print(starts)
                    #if start > -1 :
                    start = para.text.find(phrase)
                    pre = para.text[:start]
                    post = para.text[start+len(phrase):]
                    para.text = pre
                    para.add_run(phrase)
                    para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
                    para.add_run(post)
          
        doc.save( file )

with

for filename in os.listdir(path):
    if filename.endswith(".docx"):
        file = "C:\Users\files\" + filename
        print(file)
        doc = Document(file)
        for para in doc.paragraphs:
            text = para.text
            if len(re_highlight.findall(text)) > 0:
                matches = re_highlight.finditer(text)
                para.text = ''
                p3 = 0
                for match in matches:
                    p1 = p3
                    p2, p3 = match.span()
                    para.add_run(text[p1:p2])
                    run = para.add_run(text[p2:p3])
                    run.font.highlight_color = WD_COLOR_INDEX.YELLOW
                para.add_run(text[p3:])
        doc.save(file)

It worked for the sample you provided. But I’m not a regex-wiz, there might be a conciser solution.

Highlighting the certain words in a word document with python

Answers (1)

Related Questions