Droid-Bird
Droid-Bird

Reputation: 1519

How do I replace or mask text in in pdf analyzed/made searchable by Microsoft Azure Document Intelligence?

MS Azure lets you create searchable PDF, as documented here.

The code is as follows,

from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeOutputOption, AnalyzeResult

endpoint = os.environ["DOCUMENTINTELLIGENCE_ENDPOINT"]
key = os.environ["DOCUMENTINTELLIGENCE_API_KEY"]

document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))

with open(path_to_sample_documents, "rb") as f:
    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-read",
        body=f,
        output=[AnalyzeOutputOption.PDF],
    )
result: AnalyzeResult = poller.result()
operation_id = poller.details["operation_id"]

response = document_intelligence_client.get_analyze_result_pdf(model_id=result.model_id, result_id=operation_id)
with open("analyze_result.pdf", "wb") as writer:
    writer.writelines(response)

The "result" in the code above has the text (and decoration), and the response has completed PDF, that is written to disk at the last line.

My problem is, after analyzing the text in the "result" I need to replace or mask some text, and save the PDF file keeping the original structure. How can I solve it?

Upvotes: 0

Views: 77

Answers (1)

Suresh Chikkam
Suresh Chikkam

Reputation: 3448

Retrieve the searchable PDF from Azure Document Intelligence. Extract the text layout and structure and replace or mask specific words then save the modified PDF while keeping the original layout.

Here, I have installed pymupdf dependencies to load the PDF into PyMuPDF (fitz).

enter image description here

Modified code:

import os
import fitz  # PyMuPDF
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeOutputOption, AnalyzeResult

# Set up Azure Document Intelligence credentials
endpoint = "https://xxxxxxxxxxxxxxxxxxxx.cognitiveservices.azure.com/"
key = "8R4sxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxCOG8YNo"

# Initialize Document Intelligence Client
document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))

# Path to input PDF file
pdf_path = r"C:\Users\xxxxxxxx\Downloads\sample-1.pdf"

# Analyze the document
with open(pdf_path, "rb") as f:
    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-read",
        body=f,
        output=[AnalyzeOutputOption.PDF],  # Request PDF output
    )

result: AnalyzeResult = poller.result()
operation_id = poller.details["operation_id"]

# Get the searchable PDF result from Azure
response = document_intelligence_client.get_analyze_result_pdf(model_id=result.model_id, result_id=operation_id)

# Save the generated PDF from Azure
searchable_pdf_path = "searchable_output.pdf"
with open(searchable_pdf_path, "wb") as writer:
    writer.writelines(response)

print(f"Searchable PDF saved as {searchable_pdf_path}")

# Load the searchable PDF with PyMuPDF
doc = fitz.open(searchable_pdf_path)

# Define words to replace/mask
replacement_map = {
    "Your Company": "Redacted Company",
    "123 Your Street": "Hidden Address",
    "Product Overview": "[Confidential]",
}

# Iterate through pages and replace text
for page in doc:
    for original_text, replacement_text in replacement_map.items():
        text_instances = page.search_for(original_text)  # Find text occurrences
        
        for inst in text_instances:
            # Redact (mask) the original text by drawing a white rectangle
            page.add_redact_annot(inst, fill=(1, 1, 1))  # White background
            
            # Apply redaction (removes original text)
            page.apply_redactions()

            # Insert new text at the same position
            page.insert_text((inst[0], inst[1]), replacement_text, fontsize=10, color=(1, 0, 0))  # Red color text

modified_pdf_path = "modified_output.pdf"
doc.save(modified_pdf_path)
doc.close()

print(f"Modified PDF saved as {modified_pdf_path}")

When using Azure Document Intelligence to generate a searchable PDF, the text is embedded as selectable text rather than an image. If you need to replace or mask text in this PDF while preserving the original structure, use PyMuPDF (fitz).

Result:

enter image description here

  • searchable_output.pdf: The original searchable PDF.

  • modified_output.pdf: The final PDF with masked/replaced text.

Upvotes: -1

Related Questions