Can python-docx preserve font color and styles when importing documents?

Question

Essentially what I need to do is write a program that takes in many .docx files and puts them all in one, ordered in a certain way. I have importing working via:

import docx, os, glob
finaldocname = 'Midterm-All-Questions.docx'
finaldoc=docx.Document()
docstoworkon = glob.glob('*.docx')
if finaldocname in docstoworkon:
    docstoworkon.remove(finaldocname)   #dont process final doc if it exists

for f in docstoworkon:
    doc=docx.Document(f)

    fullText=[]
    for para in doc.paragraphs:
        fullText.append(para.text)  #generates a long text list

    # finaldoc.styles = doc.styles
    for l in fullText:
        # if l=='u\'\n\'':
        if '#' in l:
            print('We got here!')
            if '#1 ' not in l:  #check last two characters to see if this is the first question
                finaldoc.add_section()  #only add a page break between questions
        finaldoc.add_paragraph(l)
        # finaldoc.add_page_break
        # finaldoc.add_page_break
finaldoc.save(finaldocname)

But I need to preserve text styles, like font colors, sizes, italics, etc., and they aren't in this method since it just gets the raw text and dumps it. I can't find anything on the python-docx documentation about preserving text styles or importing in something other than raw text. Does anyone know how to go about this?

MyNameIsCaleb · Accepted Answer

Styles are a bit difficult to work with in python-docx but it can be done.

See this explanation first to understand some of the problems with styles and Word.

The Long Way

When you read in a file as a Document() it will bring in all of the paragraphs and within each of these are the runs. These runs are chunks of text with the same style attached to them.

You can find out how many paragraphs or runs there are by doing len() on the object or you can iterate through them like you did in your example with paragraphs.

You can inspect the style of any given paragraph but runs may have different styles than the paragraph as a whole, so I would skip to the run itself and inspect the style there using paragraphs[0].runs[0].style which will give you a style object. You can inspect the font object beyond that which will tell you a number of attributes like size, italic, bold, etc.

Now to the long solution:
You first should create a new blank paragraph, then you should go and add_run() one by one with your text from your original. For each of these you can define a style attribute but it would have to be a named style as described in the first link. You cannot apply a stlye object directly as it won't copy the attributes over. But there is a way around that: check the attributes that you care about copying to the output and then ensure your new run applies the same attributes.

doc_out = docx.Document()
for para in doc.paragraphs:
    p = doc_out.add_paragraph()
    for run in para.runs:
        r = p.add_run(run.text)
        if run.bold:
            r.bold = True
        if run.italic:
            r.italic = True
        # etc

Obviously this is inefficient and not a great solution, but it will work to ensure you have copied the style appropriately.

Add New Styles

There is a way to add styles by name but because it isn't likely that the Word document you are getting the text and styles from is using named styles (rather than just applying bold, etc. to the words that you want), it is probably going to be a long road to adding a lot of slightly different styles or sometimes even the same ones.

Unfortunately that is the best answer I have for you on how to do this. Working with Word, Outlook, and Excel documents is not great in Python, especially for what you are trying to do.

Can python-docx preserve font color and styles when importing documents?

Answers (1)

Related Questions