Reputation: 1519
MS Azure lets you create searchable PDF, as documented here.
The code is as follows,
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeOutputOption, AnalyzeResult
endpoint = os.environ["DOCUMENTINTELLIGENCE_ENDPOINT"]
key = os.environ["DOCUMENTINTELLIGENCE_API_KEY"]
document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
with open(path_to_sample_documents, "rb") as f:
poller = document_intelligence_client.begin_analyze_document(
"prebuilt-read",
body=f,
output=[AnalyzeOutputOption.PDF],
)
result: AnalyzeResult = poller.result()
operation_id = poller.details["operation_id"]
response = document_intelligence_client.get_analyze_result_pdf(model_id=result.model_id, result_id=operation_id)
with open("analyze_result.pdf", "wb") as writer:
writer.writelines(response)
The "result" in the code above has the text (and decoration), and the response has completed PDF, that is written to disk at the last line.
My problem is, after analyzing the text in the "result" I need to replace or mask some text, and save the PDF file keeping the original structure. How can I solve it?
Upvotes: 0
Views: 77
Reputation: 3448
Retrieve the searchable PDF from Azure Document Intelligence. Extract the text layout and structure and replace or mask specific words then save the modified PDF while keeping the original layout.
Here, I have installed pymupdf
dependencies to load the PDF into PyMuPDF (fitz).
Modified code:
import os
import fitz # PyMuPDF
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeOutputOption, AnalyzeResult
# Set up Azure Document Intelligence credentials
endpoint = "https://xxxxxxxxxxxxxxxxxxxx.cognitiveservices.azure.com/"
key = "8R4sxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxCOG8YNo"
# Initialize Document Intelligence Client
document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))
# Path to input PDF file
pdf_path = r"C:\Users\xxxxxxxx\Downloads\sample-1.pdf"
# Analyze the document
with open(pdf_path, "rb") as f:
poller = document_intelligence_client.begin_analyze_document(
"prebuilt-read",
body=f,
output=[AnalyzeOutputOption.PDF], # Request PDF output
)
result: AnalyzeResult = poller.result()
operation_id = poller.details["operation_id"]
# Get the searchable PDF result from Azure
response = document_intelligence_client.get_analyze_result_pdf(model_id=result.model_id, result_id=operation_id)
# Save the generated PDF from Azure
searchable_pdf_path = "searchable_output.pdf"
with open(searchable_pdf_path, "wb") as writer:
writer.writelines(response)
print(f"Searchable PDF saved as {searchable_pdf_path}")
# Load the searchable PDF with PyMuPDF
doc = fitz.open(searchable_pdf_path)
# Define words to replace/mask
replacement_map = {
"Your Company": "Redacted Company",
"123 Your Street": "Hidden Address",
"Product Overview": "[Confidential]",
}
# Iterate through pages and replace text
for page in doc:
for original_text, replacement_text in replacement_map.items():
text_instances = page.search_for(original_text) # Find text occurrences
for inst in text_instances:
# Redact (mask) the original text by drawing a white rectangle
page.add_redact_annot(inst, fill=(1, 1, 1)) # White background
# Apply redaction (removes original text)
page.apply_redactions()
# Insert new text at the same position
page.insert_text((inst[0], inst[1]), replacement_text, fontsize=10, color=(1, 0, 0)) # Red color text
modified_pdf_path = "modified_output.pdf"
doc.save(modified_pdf_path)
doc.close()
print(f"Modified PDF saved as {modified_pdf_path}")
When using Azure Document Intelligence to generate a searchable PDF, the text is embedded as selectable text rather than an image. If you need to replace or mask text in this PDF while preserving the original structure, use PyMuPDF (fitz).
Result:
searchable_output.pdf
: The original searchable PDF.
modified_output.pdf
: The final PDF with masked/replaced text.
Upvotes: -1