Reputation: 6967

Detect highlighted text in .docx

I'm trying to detect text that has a coloured background in a MS Word docx, to separate it from the "normal" text.

from docx import Document
...
# Load the document
doc = Document(docx_path)
highlighted_text = []
normal_text = []

# Iterate through all paragraphs
for para in doc.paragraphs:

  # Iterate through all runs in the paragraph
  for run in para.runs:

    print(run.text + " - " + str(run.font.highlight_color))

    # Check if the run has a highlight color set
    if run.font.highlight_color is not None:
      highlighted_text.append(run.text)
      print(f"Found highlighted text: '{run.text}' with highlight color: {run.font.highlight_color}")

return highlighted_text

However, in my test document it's only found grey highlights:

This is the results from the print statement: Text (normal) - None Text in grey - GRAY_25 (16) Found highlighted text: 'Text in grey ' with highlight color: GRAY_25 (16) Text in yellow - None Text in green - None

So not sure where I'm going wrong. I don't think the text has been been shaded as that is across a whole line.

Addendum: It only works for grey for me - which I have highlighted in MS Office - however the other highlights, which are getting missed have been done by someone else. This might have been done with an old copy of Office, or docx compatible software or some other method of highlighting he text that isn't "highlighting"

Any ideas?

Upvotes: 2

Answers (3)

Ghoul Fool

Reputation: 6967

Whilst The two answers above are correct this is more for my future self to refer back to. I looked closely at the docx's .xml and saw that the text in question was being affected by a fill colour:

w:color="auto" w:fill="FFFF00"

Adding this check to the script to look for coloured fills

  # Check for shading (fill color)
  if run._element.xpath('.//w:shd'):
    shading = run._element.xpath('.//w:shd')[0]
    fill_color = shading.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}fill')
    if fill_color:
      shaded_text.append((run.text, fill_color)

Upvotes: 0

Subir Chowdhury

Reputation: 331

This script performs well for me:

from docx import Document

def extract_highlighted_text(docx_path):
    doc = Document(docx_path)
    highlighted_texts = []

    for para in doc.paragraphs:
        for run in para.runs:
            if run.font.highlight_color is not None:
                highlighted_texts.append(run.text)

    return highlighted_texts

docx_file = "text.docx"
highlighted_texts = extract_highlighted_text(docx_file)

print("Highlighted Texts:")
for text in highlighted_texts:
    print(text)

Highlighted text in Docx:

Result:

Upvotes: 2

Sushil Behera

Reputation: 971

Your code should work.

Alternative to same could be to look for specific colors. Something like below.

from docx.enum.text import WD_COLOR_INDEX

Use below condition.

if run.font.highlight_color in [WD_COLOR_INDEX.YELLOW, WD_COLOR_INDEX.GREEN, WD_COLOR_INDEX.PINK, WD_COLOR_INDEX.BLUE, WD_COLOR_INDEX.RED, WD_COLOR_INDEX.GRAY_25, WD_COLOR_INDEX.GRAY_50]:

Upvotes: 1

Detect highlighted text in .docx

Answers (3)

Related Questions