Jaqen H'ghar
Jaqen H'ghar

Reputation: 1879

How do I duplicate a PDF with some text replacement and redaction

I am exploring couple of third party components to work with PDF through C#. These are Aspose.pdf.net and iTextSharp. Following are the details about what I am exploring them for:

I have some PDFs that contain sensitive information in form of text, like name of person, city, etc. These PDFs need to be duplicated into another copy but while creating duplicated copy, sensitive text needs to be searched & replaced with some dummy text. The replacement is essential to avoid tracing original information, by any fraudulent means. Also, the replaced text requires to be redacted.

Finding text is expected to support RegEx, as there could be variations of text that needs to be masked.

Could you please assist me how can this be done using iTextShart.

Thanks in advance.

Upvotes: 0

Views: 371

Answers (1)

Samuel Huylebroeck
Samuel Huylebroeck

Reputation: 1719

iTextSharp is capable of complete redaction(both visual as well as the data stored in the pdf) using the PdfSweep module (http://itextpdf.com/itext7/pdfsweep). In order to have the redaction happen after text search you'd have to:

  1. Extract the text from the document (can be done using iText).
  2. Search through the extracted text and obtain the positions of the text you want redacted. (needs an implementation from your side)
  3. Use these positions to define where PdfSweep has to redact. (a couple of lines of code)

By default, PdfSweep visualy redacts by drawing coloured bars over the locations, and internally removes the text and any image. While it is technically possible to use iText to fill the redacted positions with some dummy text, the implementation thereof has a number of pitfalls.

PdfSweep is closed source module for iText7, you can contact our sales team for more information on the licensing.

Upvotes: 1

Related Questions