How can I take a pdf, and convert any jpeg2000/jpx/jp2 images in it to jpeg images?

Question

I am using MacOS Mojave on a Mac Mini, and I am also using an old Kindle Dx which cannot read jpeg2000 images. It also has trouble with too many or too large jpeg images.

I cannot use touchscreens, so newer e-readers and tablets aren't a solution.

So far, I've found some buggy solutions--

I can use Willus's k2pdfopt with -mode copy and -dev dx, which rasterizes everything. It's a good solution for scanned pdfs. If more detail is needed, -mode copy without -dev dx will preserve higher resolution. It's something of a last resort for pdf-born-pdfs, since text can be uglier and harder to read, and file sizes can increase alarmingly.

I can also use Ghostscript with -dCompatibilityLevel=1.4, which doesn't rasterize everything. It converts jpeg2000 images to jpeg images. But it doesn't tackle some oversized or poorly-constructed images, it often creates dark rectangles which can obscure text, and it occasionally loses the ability to search or select text. [P.S. I mean it takes a pdf which had searchable pdf and outputs one which does not. Also if I do any kind of image downsampling or removal, it sometimes rescales everything or loses pages.]

I have experimented with options to compress images in Ghostscript, with mixed success, and with the above bugs persisting. [P.S. I think I was downsampling, yes.]

For whatever reason, MacOS Quartz filters only work if they will reduce image sizes. So they tend not to work on the buggy images.

Now my ideal solution would preserve the text itself, preferably untangling ligatures, and would compress the images like Willus's k2pdfopt. But I have no idea if that's possible or how.

Short of that-- I'm wondering if there's a way to use Ghostscript to convert the jpeg2000 images without causing the gray rectangles or losing the ability to search or select text.

or if there's a way to use Quartz filters so they work. In some older versions of MacOS they did work.

or if there's a way to batch-print these pdf files to the appropriate resolution, apparently 800x1180, reprocessing images in the process.

I don't have much programming experience. I mainly use homebrew to install command-line tools, very sloppy bash scripts, and Automator to run them.

P.S. For a minimal example of the gray rectangles in Ghostscript, using the free pdf from here: https://www.peginc.com/store/test-drive-savage-worlds-the-wild-hunt/

gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -o out.pdf in.pdf

substituting that pdf for in.pdf.

For a minimal example of losing searchable text, using the free pdf from here: http://datafortress2020.com/fileproject/details.php?image_id=498

same minimal script

Compatibility Level

gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4 -o out.pdf in.pdf

Aggressive Downsampling and Grayscale

gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dCompatibilityLevel=1.4 -g800x1080 -r150 -dPDFFitPage \ -dFastWebView -sColorConversionStrategy=Gray \ -dDownsampleColorImages=true -dDownsampleGrayImages=true -dDownsampleMonoImages=true -dColorImageResolution=75 -dGrayImageResolution=75 -dMonoImageResolution=150 -dColorImageDownsampleThreshold=1.0 -dGrayImageDownsampleThreshold=1.0 -dMonoImageDownsampleThreshold=1.0 \ -o out.pdf in.pdf

P.P.S. I can use k2pdfopt to rasterize to fit my Kindle. If the file has searchable text, this retains it, if it doesn't I can run tesseract in k2 or run ocrmypdf afterwards.

But if I want especially good graphics, or especially clear text, and the file has hundreds of pages, it will need hundreds of megs. I had blamed this on rasterizing the text, which was why my ideal solution was to keep text and rasterize images, but apparently it's an issue with the images themselves.

How can I take a pdf, and convert any jpeg2000/jpx/jp2 images in it to jpeg images?

Answers (1)

Edit

Related Questions