Reputation: 2296
I would like to highlight all the given words in a given word document, however, I can only highlight the first word in a sentence...
Example: Let's assume that I have the following words in a word document, and I would like to highlight the following words: Approximate Pending May. While it works fine in the first four lines, on the fifth line I can only highlight the "Pending"
Approximate
Pending spending
May
May
Pending Approximate xx sit May
Here is my code, could you please help me out with this?
from docx import Document
from docx.enum.text import WD_COLOR_INDEX
import pandas as pd
import os
import re
path = r"C:\\Users\\files\\"
input_file = r"C:\\Users\\files\\\Dictionary.xlsx"
# List of words
df = pd.read_excel(input_file)
my_list = df['dictionary'].tolist()
list_2 = [dic.capitalize() for dic in my_list]
list_3 = [dic.lower() for dic in my_list]
my_list.extend(list_2)
my_list.extend(list_3)
for filename in os.listdir(path):
if filename.endswith(".docx"):
file = "C:\\Users\\files\\" + filename
print(file)
doc = Document(file)
for para in doc.paragraphs:
for phrase in my_list:
#start = para.text.find(phrase)
x = para.text
starts = re.findall('\\b' + phrase + '\\b', x)
#print(start)
if len(starts)>0:
#print(starts)
#if start > -1 :
start = para.text.find(phrase)
pre = para.text[:start]
post = para.text[start+len(phrase):]
para.text = pre
para.add_run(phrase)
para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
para.add_run(post)
doc.save( file )
Upvotes: 1
Views: 1278
Reputation: 11321
I think I found the problem.
Here you insert pure text without any formatting into the paragraph which overrides any formatting done previously in the pre
part:
pre = para.text[:start]
...
para.text = pre
And here happens the same for the post
part:
post = para.text[start + len(phrase):]
...
para.add_run(post)
It always overrides any highlighting done in the previous iteration with pure text. The text
property of paragraph
gives you only a string
without the formatting. And from the documentation of the text
property of the paragraph
object:
Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text ... Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed.
You can see this if you change the order of the elements in my_list
. Only the last element will be highlighted. I started with
my_list = ['approximate', 'may', 'pending']
and the result was a highlighted Pending
. Then I switched to
my_list = ['approximate', 'pending', 'may']
and the result was a highlighted May
. I'm referring here to the last line in your example.
EDIT: Here’s an attempt to fix it.
I’ve replaced
# List of words
df = pd.read_excel(input_file)
my_list = df['dictionary'].tolist()
list_2 = [dic.capitalize() for dic in my_list]
list_3 = [dic.lower() for dic in my_list]
my_list.extend(list_2)
my_list.extend(list_3)
with
# List of words
df = pd.read_excel(input_file)
my_list = df['dictionary'].tolist()
# Setup regex
patterns = [r'\b' + word + r'\b' for word in my_list]
re_highlight = re.compile('(' + '|'.join(p for p in patterns) + ')+',
re.IGNORECASE)
and
for filename in os.listdir(path):
if filename.endswith(".docx"):
file = "C:\\Users\\files\\" + filename
print(file)
doc = Document(file)
for para in doc.paragraphs:
for phrase in my_list:
#start = para.text.find(phrase)
x = para.text
starts = re.findall('\\b' + phrase + '\\b', x)
#print(start)
if len(starts)>0:
#print(starts)
#if start > -1 :
start = para.text.find(phrase)
pre = para.text[:start]
post = para.text[start+len(phrase):]
para.text = pre
para.add_run(phrase)
para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
para.add_run(post)
doc.save( file )
with
for filename in os.listdir(path):
if filename.endswith(".docx"):
file = "C:\\Users\\files\\" + filename
print(file)
doc = Document(file)
for para in doc.paragraphs:
text = para.text
if len(re_highlight.findall(text)) > 0:
matches = re_highlight.finditer(text)
para.text = ''
p3 = 0
for match in matches:
p1 = p3
p2, p3 = match.span()
para.add_run(text[p1:p2])
run = para.add_run(text[p2:p3])
run.font.highlight_color = WD_COLOR_INDEX.YELLOW
para.add_run(text[p3:])
doc.save(file)
It worked for the sample you provided. But I’m not a regex-wiz, there might be a conciser solution.
Upvotes: 1