Reputation: 63
I'm very much a beginner with Python but have been reading (and trying) through docx, textract and docx2txt, all of which should do what I am after but I can't get any of them to work - docx2txt looks like it should be the most suitable...
I just want a list of all hyperlink targets within a docx file. There are many in the document(s) that I want to look at but I can't get anything to give an actual ouput and instead return the errors of 'no hyperlinks found'. I'm starting small and just trying to get it working on a single file before moving to scraping multiple files.
#import textract
#import docx
import os
import docx2txt
import re
def extract_hyperlinks_from_word(file_path):
try:
hyperlinks = []
# Extract text content from the Word document using docx2txt
text = docx2txt.process(file_path)
# Use regular expressions to find hyperlinks in the extracted text
hyperlink_pattern = r'(https?://\S+|www\.\S+)'
matches = re.finditer(hyperlink_pattern, text)
# Iterate matches and append them to the list
for match in matches:
hyperlinks.append(match.group())
return hyperlinks
except Exception as e:
# Handle exceptions, print error message
print(f"Error extracting hyperlinks from Word document {file_path}: {e}")
return None
# Check if main or module
if __name__ == "__main__":
# Path to document
word_document_path = 'filepath.docx'
# Call hyperlink extraction function
extracted_hyperlinks = extract_hyperlinks_from_word(word_document_path)
# Check if hyperlinks were extracted and print the results
if extracted_hyperlinks:
print("Extracted Hyperlinks:")
for hyperlink in extracted_hyperlinks:
print(hyperlink)
else:
print("No hyperlinks found.")
I've found similar problems on here where they wanted to alter hyperlinks - but I just want to read them.
They [the hyperlinks] are in the Word document as text which has been converted to a hyperlink via right click > link. The addresses are all standard web address style so I thought the use of regular expressions may have helped... but I just can't seem to get it working.
Upvotes: 0
Views: 891
Reputation: 63
Fixed it, turns out (after much digging) what I needed to look at was not the hyperlinks side of things but the 'relations' within the document. I've not tested on a mixed type with non-relation hyperlinks though those work through other methods I tried earlier. The code below solved my issues:
from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def get_hyperlinks_from_docx(docx_filename):
# name of document to open
doc = Document(docx_filename)
# Iterate through relationships in the file
for relId, rel in doc.part.rels.items():
# Check if the relationship type is a hyperlink
if rel.reltype == RT.HYPERLINK:
#print(f"Relationship ID: {relId}") # irrelevant, only interested in target, can comment out.
print(f"Target URL: {rel._target}")
# Document path
document_path = "testDocument.docx"
# Call the function with the document name
get_hyperlinks_from_docx(document_path)
Upvotes: 0
Reputation: 11
you can use the docx library to get urls from your file
pip install python-docx
from docx import Document
doc = Document('tstdoc.docx')
# Extract hyperlinks from the document part
for rel in doc.part.rels.values():
if "http" in rel.target_ref:
print("url:", rel.target_ref)
Upvotes: 1