Reputation: 6967
I'm trying to detect text that has a coloured background in a MS Word docx, to separate it from the "normal" text.
from docx import Document
...
# Load the document
doc = Document(docx_path)
highlighted_text = []
normal_text = []
# Iterate through all paragraphs
for para in doc.paragraphs:
# Iterate through all runs in the paragraph
for run in para.runs:
print(run.text + " - " + str(run.font.highlight_color))
# Check if the run has a highlight color set
if run.font.highlight_color is not None:
highlighted_text.append(run.text)
print(f"Found highlighted text: '{run.text}' with highlight color: {run.font.highlight_color}")
return highlighted_text
However, in my test document it's only found grey highlights:
This is the results from the print statement: Text (normal) - None Text in grey - GRAY_25 (16) Found highlighted text: 'Text in grey ' with highlight color: GRAY_25 (16) Text in yellow - None Text in green - None
So not sure where I'm going wrong. I don't think the text has been been shaded as that is across a whole line.
Addendum: It only works for grey for me - which I have highlighted in MS Office - however the other highlights, which are getting missed have been done by someone else. This might have been done with an old copy of Office, or docx compatible software or some other method of highlighting he text that isn't "highlighting"
Any ideas?
Upvotes: 2
Views: 54
Reputation: 6967
Whilst The two answers above are correct this is more for my future self to refer back to. I looked closely at the docx's .xml and saw that the text in question was being affected by a fill colour:
w:color="auto" w:fill="FFFF00"
Adding this check to the script to look for coloured fills
# Check for shading (fill color)
if run._element.xpath('.//w:shd'):
shading = run._element.xpath('.//w:shd')[0]
fill_color = shading.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}fill')
if fill_color:
shaded_text.append((run.text, fill_color)
Upvotes: 0
Reputation: 331
This script performs well for me:
from docx import Document
def extract_highlighted_text(docx_path):
doc = Document(docx_path)
highlighted_texts = []
for para in doc.paragraphs:
for run in para.runs:
if run.font.highlight_color is not None:
highlighted_texts.append(run.text)
return highlighted_texts
docx_file = "text.docx"
highlighted_texts = extract_highlighted_text(docx_file)
print("Highlighted Texts:")
for text in highlighted_texts:
print(text)
Result:
Upvotes: 2
Reputation: 971
Your code should work.
Alternative to same could be to look for specific colors. Something like below.
from docx.enum.text import WD_COLOR_INDEX
Use below condition.
if run.font.highlight_color in [WD_COLOR_INDEX.YELLOW, WD_COLOR_INDEX.GREEN, WD_COLOR_INDEX.PINK, WD_COLOR_INDEX.BLUE, WD_COLOR_INDEX.RED, WD_COLOR_INDEX.GRAY_25, WD_COLOR_INDEX.GRAY_50]:
Upvotes: 1