Reputation: 67
I'm looking for a solution to remove/delete ALL text from a pdf. I've been using iTextSharp for a while now, and extracting text from a pdf with it is easy (wihouth the use of OCR). However I can't find an option to delete the text.
This solution frankly doesn't work for me.
page.GetAsArray(PdfName.CONTENTS);
returns null for me, also when using PdfName.Text
and some others I've tried.
The library to use doesn't really matter, I just think iTextsharp should be able to do this. However if there is another (free) solution, bring it
EDIT: Just to make clear why I want to remove all text from the pdfs
I want to reduce the size of the pdf's. I do this by reducing the resolution of the images in the pdf. However, in alot of cases the vector images take up most of the space. So I thought of the following: Remove all text, than convert the remaining pdf (with only the images and vectors) to a bitmap (jpeg). After that I paste the text over it again. Another option would be to make the text invisible, but I don't think this is any easier.
Upvotes: 2
Views: 5533
Reputation: 171
To remove all text in a PDF, the easiest solution is using ghostcript
gs -o output_no_text.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf
Upvotes: 0
Reputation: 90193
Now that you've updated your question, and revealed the motivation of the intended measure, let me tell you the truth:
These measures will in no way reduce the size of PDFs.
Instead they'll lead to a hugely increased file:
First removing text + fonts may lead to a slight shrinking of the size, yes.
Then converting the remains of the page to a bitmap will certainly increase the size hugely (or you agree with very low image quality, maybe?).
At last 'pasting' text over it again will increase the file size again (very likely by the same amount you saved in the first step).
It's not a good plan at all.
If you provide (a link to) one of your typical sample PDF file I can probably come up with a Ghostscript (plus other tools) command line that works out of the box and shrinks the PDF size more efficiently.
Upvotes: 2
Reputation: 77528
/Contents
of a page dictionary doesn't always consist of an array. It should be evident that GetAsArray()
returns null
if the content is stored as a stream.GetAsStream()
and you remove all the text contents from the stream, then you may still have text content in XObjects. That text won't be referenced from a content stream, but iText won't be able to remove the XObjects as 'unused objects' because the objects will still be referenced from the /Resources
in the page dictionary.Please read ISO-32000-1 to find out what you're doing wrong.
Upvotes: 2