rpilkey
rpilkey

Reputation: 995

How to program a text search and replace in PDF files

How would I be able to programmatically search and replace some text in a large number of PDF files? I would like to remove a URL that has been added to a set of files. I have been able to remove the link using javascript under Batch Processing in Adobe Pro, but the link text remains. I have seen recommendations to use text touchup, which works manually, but I don't want to modify 1300 files manually.

Upvotes: 32

Views: 56889

Answers (13)

Hermann
Hermann

Reputation: 637

16 years later, I still could not find an existing solution, so I started my own. Not JavaScript, but python. It is at a "works for me" state, but it needs more sample documents for catching all the corner cases. Contributions (especially code) are welcome: https://github.com/hoehermann/pypdf_strreplace

To answer the question at least in some parts, these are the steps that need to happen:

  1. Parse the PDF file (I rely on PyPDF).
  2. Interpret the Content Streams (see 7.8 Content Streams and Resources in the reference).
  3. Decode the operands in the Text-Showing Operators (see 9.4.3 Text-Showing Operators) to a editable string (I chose Python's internal UCS strings).
  4. Replace the text.
  5. Encode the text to whatever byte-stream is needed for the particular operand.
  6. Write the changed Content Streams into the output PDF file (again, I rely on PyPDF for this).

Upvotes: 0

Richard
Richard

Reputation: 1

If the PDF is something you generated, try a different format. For example, libreoffice-draw can also output to SVG. These SVG files are very easy to edit (sed will do it). Then convert back with rsvg-convert. HTH.

Upvotes: -1

stirhale
stirhale

Reputation: 27

Not sure I would want to do all the work to write the code to modify your 1300 files when there is a program that can do it for you. The other day, I used the Professional version of Infix to batch modify almost 100 files using its "Find and Replace in Files" feature. It works great. I have evaluated other programs in hopes finding an find and replace functionality similar to Microsoft Word. Infix was the only one I found that can do it. Check out: http://www.iceni.com/infix-pro.htm latest product https://www.iceni.com/infixServer.htm

Upvotes: 1

rogerdpack
rogerdpack

Reputation: 66771

It appears that even with uncompressed pdf's, text is sometimes formatted funky internally. This makes "normal" text command-line replacement, a la sed, not work or not be trivial.

I couldn't find anything that seemed to work with these glyph spacing offsets, i.e. text formatted like this (which seems very common in pdf's), in this example, the word "Other information" is stored like this in a pdf:

 [(O)-16(ther i)-20(nformati)-11(on )]TJ

I wrote a command line tool that is able to replace text embedded within these glyph offsets. It works OK for common use cases. Check it out here. Linux and windows.

First uncompress your pdf, then cd to the checked out git code and:

Syntax

 $ crystal replaceinpdf.cr input_filename.pdf "something you want replaced" "what you want it replaced with" output_filename.pdf

Enjoy! See the git repo for more details. Requests welcome.

Upvotes: 0

Larry
Larry

Reputation: 507

I have also become desperate. After 10 PDF Editor installations which all cost money, and no success:

pdftk + editor suffice:

Replace Text in PDF Files

  • Use pdftk to uncompress PDF page streams

    pdftk original.pdf output original.uncompressed.pdf uncompress
    
  • Replace the text (sometimes this works, sometimes it doesn't) within original.uncompressed.pdf

  • Repair the modified (and now broken) PDF

    pdftk original.uncompressed.pdf output original.uncompressed.fixed.pdf
    

(from Joel Dare)

Upvotes: 11

Ahmet Firat Keler
Ahmet Firat Keler

Reputation: 4031

This library has an extensive support. Check it out.

PDF-LIB

Upvotes: -1

Chris Dolan
Chris Dolan

Reputation: 8963

Finding text in a PDF can be inherently hard because of the graphical nature of the document format -- the letters you are searching for may not be contiguous in the file. That said, CAM::PDF has some search-replace capabilities and heuristics. Give changepagestring.pl a try and see if it works on your PDFs.

To install:

 $ cpan install CAM::PDF
 # start a new terminal if this is your first cpan module
 $ changepagestring.pl input.pdf oldtext newtext output.pdf

Upvotes: 22

Tilal Ahmad
Tilal Ahmad

Reputation: 939

Although it is quite an old thread. Just wanted to share a Node.js package option to search and replace text in PDF: Aspose.PDF Cloud SDK for Node.js. It is paid product but it provides 150 free monthly API calls.


const { PdfApi } = require("asposepdfcloud");
const { TextReplaceListRequest }= require("asposepdfcloud/src/models/textReplaceListRequest");
const { TextReplace }= require("asposepdfcloud/src/models/textReplace");

// Get Client ID and Client Secret from https://dashboard.aspose.cloud/
pdfApi = new PdfApi("xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxx");
var fs = require('fs');

const name = "02_pages.pdf";
const remoteTempFolder = "Temp";
//const localTestDataFolder = "C:\\Temp";
//const path = remoteTempFolder + "\\" + name;
//const outputFile= "Replace_output.pdf";


// Upload File
//pdfApi.uploadFile(path, fs.readFileSync(localTestDataFolder + "\\" + name)).then((result) => {  
//                     console.log("Uploaded File");    
//                    }).catch(function(err) {
    // Deal with an error
//    console.log(err);
//});
    
const textReplace= new TextReplace();
        textReplace.oldValue= "origami"; 
        textReplace.newValue= "aspose";
        textReplace.regex= false;

const textReplace1= new TextReplace();
        textReplace1.oldValue= "candy"; 
        textReplace1.newValue= "biscuit";
        textReplace1.regex= false;
    
const trr = new TextReplaceListRequest();
            trr.textReplaces = [textReplace,textReplace1];


// Replace text
pdfApi.postDocumentTextReplace(name, trr, null, remoteTempFolder).then((result) => {    
    console.log(result.body.code);                  
}).catch(function(err) {
    // Deal with an error
    console.log(err);
});

//Download file
//const outputPath = "C:/Temp/" + outputFile;

//pdfApi.downloadFile(path).then((result) => {    
//  fs.writeFileSync(outputPath, result.body);
//    console.log("File Downloaded");    
//}).catch(function(err) {
    // Deal with an error
//    console.log(err);
//});

Upvotes: -1

smith
smith

Reputation: 1

I suggest you may use VeryPDF PDF Text Replacer Command Line software to batch replace text in PDF pages, you can run pdftr.exe to replace text in PDF pages easily, for example,

pdftr.exe -contentreplace "My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -searchandoverlaytext "My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -searchandoverlaytext "My Name=>D:\temp\myname.png*20*20" D:\in.pdf D:\out.pdf

pdftr.exe -pagerange 1-3 -contentreplace "Old Text=>New Text||VeryPDF=>VeryDOC||My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -searchtext "string" C:\in.pdf

pdftr.exe -pagerange 1 -searchtext "string" C:\in.pdf

pdftr.exe -pagerange 1 -searchandoverlaytext "Old Text=>New Text||VeryPDF=>VeryDOC||My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -overlaytextfontname "Arial" -overlaytextcolor FF0000 -overlaybgcolor 00FF00 -searchandoverlaytext "Old Text=>New Text||VeryPDF=>VeryDOC||My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -opw 123 -upw 456 -contentreplace "Old Text=>New Text||VeryPDF=>VeryDOC||My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -searchandoverlaytext "PDFcamp Printer=>VeryPDF Printer" -overlaytextfontsize 8 D:\in.pdf D:\out.pdf

pdftr.exe -searchandoverlaytext "PDFcamp Printer=>VeryPDF Printer" -overlaytextfontsize 80% D:\in.pdf D:\out.pdf

Upvotes: -1

Dimitar
Dimitar

Reputation: 927

The question is for a programmatic solution, but I will still share this free online tool which helped me mass replace text in some PDF files:

http://www.pdfdu.com/pdf-replace-text.aspx

I did not notice any ads or other modifications in the resulting PDF files after replacing the text.

I was not able to make the changes locally with the software I tried. I think the main problem was that I was missing the font used in the PDF and it did not work properly, even with Acrobat Pro. The online tool did not complain and produced a great result.

Upvotes: 0

d-b
d-b

Reputation: 971

This is just half a solution but I used Touch up combined with AppleScript's support for sending keystrokes to replace a string in thousands of table cells. Depending on how your pages are layout it could work for you. In my case I had to manually insert the cursor in the beginning of every table (tens of tables - quite manageable for a manual process) but after that i replaced thousands of cells automatically.

Upvotes: 1

davr
davr

Reputation: 19137

You can use the 'redaction' feature in Adobe Acrobat Pro to find & replace all references in a single document in one step...not sure if it can be automated to multiple steps.

http://help.adobe.com/en_US/Acrobat/9.0/Professional/WS5E28D332-9FF7-4569-AFAD-79AD60092D4D.w.html

Upvotes: 1

sobusola
sobusola

Reputation: 7

I just finished trying out infix for a text that is comprised of text ladened with diacritics with the hope of generating another text where characters with double and composed diacritics are replaced by alternate with single diacritics. Infix is such definitely a good solution for someone who does not care for the trouble of understanding the working of programmatic solutions. All the request changes were effected. Still need to understand how to effect reflow of words that change the layout of text.

Upvotes: -1

Related Questions