Z.L
Z.L

Reputation: 149

Extract hyperlink from pptx

I want to extract the hyperlink from pptx, I know how to do it in word, but anyone knows how to extract it from pptx?

For example, I have a text below in pptx and I want to get the url https://stackoverflow.com/ :


Hello, stackoverflow


I tried to write the Python code to get the text:

from pptx import Presentation
from pptx.opc.constants import RELATIONSHIP_TYPE as RT

ppt = Presentation('data/ppt.pptx')

for i, sld in enumerate(ppt.slides, start=1):
    print(f'-- {i} --')
    for shp in sld.shapes:
        if shp.has_text_frame:
            print(shp.text)

But I just want to print the text and the URL when the text with hyperlink.

Upvotes: 3

Views: 1543

Answers (3)

aehruesch
aehruesch

Reputation: 55

This worked for me, using the win32com libary:

import win32com.client
filename = 'data/ppt.pptx'
PptApp = win32com.client.Dispatch("Powerpoint.Application")
PptApp.Visible = True
pptx = PptApp.Presentations.Open(filename, ReadOnly= False)
for slide in pptx.slides:
    for shape in slide.shapes:
        try:
            if not shape.hasTextFrame:
                continue
        except:
            pass
        text = shape.textFrame.TextRange.Text
        if r"://" in text:
            print(text)
PptApp.Quit()
pptx =  None
PptApp = None

Upvotes: 0

scanny
scanny

Reputation: 29021

In python-pptx, a hyperlink can appear on a Run, which I believe is what you're after. Note that this means zero-or-more hyperlinks can appear in a given shape. Note also that a hyperlink can also appear on an overall shape, such that clicking on the shape follows the link. In that case, the text of the URL does not appear.

from pptx import Presentation

prs = Presentation('data/ppt.pptx')

for slide in prs.slides:
    for shape in slide.shapes:
        if not shape.has_text_frame:
            continue
        for paragraph in shape.text_frame.paragraphs:
            for run in paragraph.runs:
                address = run.hyperlink.address
                if address is None:
                    continue
                print(address)

The relevant sections of the documentation are here:
https://python-pptx.readthedocs.io/en/latest/api/text.html#run-objects

and here:
https://python-pptx.readthedocs.io/en/latest/api/action.html#hyperlink-objects

Upvotes: 2

Steve Rindsberg
Steve Rindsberg

Reputation: 14809

I can't help with the python part but here's an example of how to extract the hyperlink URLs themselves, rather than the text that the links are applied to, which is what think you're after.

Each slide in PPT has a Hyperlinks collection that contains all of the hyperlinks on the slide. Each hyperlink has an .Address and .SubAddress property. In the case of a URL like https://www.someplace.com#placeholder, the .Address would be https://www.someplace.com and the .SubAddress would be placeholder, for example.

Sub ExtractHyperlinks()

Dim oSl As Slide
Dim oHl As Hyperlink
Dim sOutput As String

' Look at each slide in the presentation
For Each oSl In ActivePresentation.Slides
    sOutput = sOutput & "Slide " & oSl.SlideIndex & vbCrLf
    ' Look at each hyperlink on the slide
    For Each oHl In oSl.Hyperlinks
        sOutput = sOutput & vbTab & oHl.Address & " | " & oHl.SubAddress & vbCrLf
    Next    ' Hyperlink
Next    ' Slide

Debug.Print sOutput

End Sub

Upvotes: 0

Related Questions