Michael Berk
Michael Berk

Reputation: 725

Preserve text format on read/write to shape text python pptx

I am looking to perform text replacements in a shape's text. I am using code similar to snippet below:

# define key/value
SRKeys, SRVals = ['x','y','z'], [1,2,3]

# define text
text = shape.text

# iterate through values and perform subs
for i in range(len(SRKeys)):
    # replace text
    text = text.replace(SRKeys[i], str(SRVals[i]))

# write text subs to comment box
shape.text = text

However, if the initial shape.text has formatted characters (bolded for example), the formatting is removed on the read. Is there a solution for this?

The only thing I could think of is to iterate over the characters and check for formatting, then add these formats before writing to shape.text.

Upvotes: 3

Views: 1502

Answers (2)

Michael Berk
Michael Berk

Reputation: 725

Here is an adapted version of the code I'm using (inspired by @scanny's answer). It replaces text for all shapes (with text frame) on a slide.

from pptx import Presentation

prs = Presentation('../../test.pptx')
slide = prs.slides[1]

# iterate through all shapes on slide
for shape in slide.shapes:
    if not shape.has_text_frame:
        continue
        
    # iterate through paragarphs in shape
    for p in shape.text_frame.paragraphs:
        # store formats and their runs by index (not dict because of duplicate runs)
        formats, newRuns = [], []

        # iterate through runs
        for r in p.runs:
            # get text
            text = r.text

            # replace text
            text = text.replace('s','xyz')

            # store run
            newRuns.append(text)

            # store format
            formats.append({'size':r.font.size,
                            'bold':r.font.bold,
                            'underline':r.font.underline,
                            'italic':r.font.italic})

        # clear paragraph
        p.clear()

        # iterate through new runs and formats and write to paragraph
        for i in range(len(newRuns)):
            # add run with text
            run = p.add_run()
            run.text = newRuns[i]

            # format run
            run.font.bold = formats[i]['bold']
            run.font.italic = formats[i]['italic']
            run.font.size = formats[i]['size']
            run.font.underline = formats[i]['underline']

prs.save('../../test.pptx')

Upvotes: 2

scanny
scanny

Reputation: 28893

@usr2564301 is on the right track. Character formatting (aka. "font") is specified at the run level. This is what a run is; a "run" (sequence) of characters all sharing the same character formatting.

When you assign to shape.text you replace all the runs that used to be there with a single new run having default formatting. If you want to preserve formatting you need to preserve whatever runs are not directly involved in the text replacement.

This is not a trivial problem because there is no guarantee runs break on word boundaries. Try printing out the runs for a few paragraphs and I think you'll see what I mean.

In rough pseudocode, I think this is the approach you would need to take:

  • do your search for the target text in the paragraph to determine the offset of its first character.
  • traverse all the runs in the paragraph keeping a running total of how many characters there are before each run, maybe something like (run_idx, prefix_len, length): (0, 0, 8), (1, 8, 4), (2, 12, 9), etc.
  • Identify which run is the starting, ending, and in-between runs involving your search string.
  • Split the first run at the start of the search term, split the last run at the end of the search term, and delete all but the first of the "middle" runs.
  • Change the text of the middle run to the replacement text and clone the formatting from the prior (original start) run. Maybe this last bit you do at split-start time.

This preserves any runs that do not involve the search string and preserves the formatting of the "matched" word in the "replaced" word.

This requires a few operations that are not directly supported by the current API. For those you'd need to use lower-level lxml calls to directly manipulate the XML, although you could get hold of all the existing elements you need from python-pptx objects without ever having to parse in the XML yourself.

Upvotes: 2

Related Questions