Dan Lewis
Dan Lewis

Reputation: 63

Using python to scrape URLs from hyperlinks within Word

I'm very much a beginner with Python but have been reading (and trying) through docx, textract and docx2txt, all of which should do what I am after but I can't get any of them to work - docx2txt looks like it should be the most suitable...

I just want a list of all hyperlink targets within a docx file. There are many in the document(s) that I want to look at but I can't get anything to give an actual ouput and instead return the errors of 'no hyperlinks found'. I'm starting small and just trying to get it working on a single file before moving to scraping multiple files.

#import textract
#import docx
import os
import docx2txt
import re

def extract_hyperlinks_from_word(file_path):
    try:
        hyperlinks = []

        # Extract text content from the Word document using docx2txt
        text = docx2txt.process(file_path)

        # Use regular expressions to find hyperlinks in the extracted text
        hyperlink_pattern = r'(https?://\S+|www\.\S+)'
        matches = re.finditer(hyperlink_pattern, text)

        # Iterate matches and append them to the list
        for match in matches:
            hyperlinks.append(match.group())

        return hyperlinks
    except Exception as e:
        # Handle exceptions, print error message
        print(f"Error extracting hyperlinks from Word document {file_path}: {e}")
        return None

# Check if main or module
if __name__ == "__main__":
    # Path to document
    word_document_path = 'filepath.docx'

    # Call hyperlink extraction function
    extracted_hyperlinks = extract_hyperlinks_from_word(word_document_path)

    # Check if hyperlinks were extracted and print the results
    if extracted_hyperlinks:
        print("Extracted Hyperlinks:")
        for hyperlink in extracted_hyperlinks:
            print(hyperlink)
    else:
        print("No hyperlinks found.")

I've found similar problems on here where they wanted to alter hyperlinks - but I just want to read them.

They [the hyperlinks] are in the Word document as text which has been converted to a hyperlink via right click > link. The addresses are all standard web address style so I thought the use of regular expressions may have helped... but I just can't seem to get it working.

Upvotes: 0

Views: 891

Answers (2)

Dan Lewis
Dan Lewis

Reputation: 63

Fixed it, turns out (after much digging) what I needed to look at was not the hyperlinks side of things but the 'relations' within the document. I've not tested on a mixed type with non-relation hyperlinks though those work through other methods I tried earlier. The code below solved my issues:

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT

def get_hyperlinks_from_docx(docx_filename):
    # name of document to open
    doc = Document(docx_filename)
    # Iterate through relationships in the file
    for relId, rel in doc.part.rels.items():
        # Check if the relationship type is a hyperlink
        if rel.reltype == RT.HYPERLINK:
            #print(f"Relationship ID: {relId}") # irrelevant, only interested in target, can comment out.
            print(f"Target URL: {rel._target}")
# Document path
document_path = "testDocument.docx"

# Call the function with the document name
get_hyperlinks_from_docx(document_path)

Upvotes: 0

Bear
Bear

Reputation: 11

you can use the docx library to get urls from your file

pip install python-docx

from docx import Document

doc = Document('tstdoc.docx')

# Extract hyperlinks from the document part
for rel in doc.part.rels.values():
    if "http" in rel.target_ref:
        print("url:", rel.target_ref)

Upvotes: 1

Related Questions