JStroop
JStroop

Reputation: 475

High-res images from PDFS

I'm working on a project in which I need to extract a TIFF per page from multi-page PDFs. The PDFs contain images only and there is one image per page (I believe they were made on some kind of photocopier/scanner, but haven't confirmed this). The TIFFs are then used to create several other derivative versions of the document so the higher the resolution the better.

I've found two recipes, both with helpful aspects, but neither is ideal. Hoping someone can help me tune one of them, or offer a third option.

Recipe 1, pdfimages and ImageMagick:

First do:

$ pdfimages $MY_PDF.pdf foo"

Which results in several .pbm files (named foo-000.pbm, foo-001.pbm), etc.

Then for each *.pbm do:

$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif

Pro: The resultant TIFFs are a healthy 3300+ pixels on the long dimension, (-resize just serves to normalize everything)

Con: The orientation of the pages is lost, and they come out rotated different directions (they follow logical patterns, so probably they are the orientation in which they were fed to the scanner??).

Recipe 2 Imagemagick solo:

convert +adjoin $MY_PDF.pdf pages.tif

This gives me a TIFF per page (pages-0.tif, pages-1.tif, etc.).

Pro: Orientation stays!

Con: The long dimension of the resultant file is < 800 px, which is too small to be useful, and it looks as though there is some compression applied.

How can I ditch the scaling of the image stream in the PDF, but retain the orientation? Is there some more magick in ImageMagick that I'm missing? Something else entirely?

Upvotes: 2

Views: 3119

Answers (2)

Betagan
Betagan

Reputation: 121

Sorry for the noise on this old topic, but google took me here as one of the top results and it might take others, so I thought I'd post the solution for the TO's question that I found here: http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick

In Short: You have to tell ImageMagick at which density it should scan the PDF.

so convert -density 600x600 foo.pdf foo.png will tell ImageMagick to treat the PDF as if it had a 600dpi resolution and thus output much larger PNGs. In my case, the resulting foo.png was sized 5000x6600px. You can optionally add -resize 3000x3000 or whatever size you require and it will be scaled down.

Note that as long as you only have vector images or text in your PDF-files, density might be set as high as needed. If the PDF contains rasterized images, it won't look good if you set it higher than those images' dpi, surprise! :)

Chris

Upvotes: 2

JStroop
JStroop

Reputation: 475

I wanted to share my solution...it may not work for everyone, but since nothing else has come around maybe it will help someone else. I wound up going with the first option in my question, which was to use pdfimages to get large images that were rotated every which way. I then found a way to use OCR and word counts to guess at the orientation, which got me from (estimated) 25% rotated accurately to above 90%.

The flow is as follows:

  1. Use pdfimages (apt-get install poppler-utils) to get a set of pbm files (not shown below).
  2. For each file:
    1. Make four versions, rotated 0, 90, 180, and 270 degrees (I refer to them as "north", "east", "south", and "west" in my code).
    2. OCR each. The two with the lowest word count are likely the right-side up and upside down versions. This was over 99% accurate in my set of images processed to date.
    3. From the two with the lowest word count, run the OCR output through a spell check. The file with the least spelling errors (i.e. most recognizable words) is likely to be correct. For my set this was about 93% (up from 25%) accurate based on a sample of 500.

YMMV. My files are bitonal and highly textual. The source images are an average of 3300 px on the long side. I can't speak to greyscale or color, or files with a lot of images. Most of my source PDFs are bad scans of old photocopies, so the accuracy might be even better with cleaner files. Using -despeckle during the rotation made no difference and slowed things down considerably (~5×). I chose ocrad for speed and not accuracy since I only need rough numbers and am throwing away the OCR. Re: performance, my nothing-special Linux desktop machine can run the whole script over about 2-3 files/per second.

Here's the implementation in a simple bash script:

#!/bin/bash
# Rotates a pbm file in place.

# Pass a .pbm as the only arg.
file=$1

TMP="/tmp/rotation-calc"
mkdir $TMP

# Dependencies:                                                                 
# convert: apt-get install imagemagick                                          
# ocrad: sudo apt-get install ocrad                                               
ASPELL="/usr/bin/aspell"
AWK="/usr/bin/awk"
BASENAME="/usr/bin/basename"
CONVERT="/usr/bin/convert"
DIRNAME="/usr/bin/dirname"
HEAD="/usr/bin/head"
OCRAD="/usr/bin/ocrad"
SORT="/usr/bin/sort"
WC="/usr/bin/wc"

# Make copies in all four orientations (the src file is north; copy it to make 
# things less confusing)
file_name=$(basename $file)
north_file="$TMP/$file_name-north"
east_file="$TMP/$file_name-east"
south_file="$TMP/$file_name-south"
west_file="$TMP/$file_name-west"

cp  $file $north_file
$CONVERT -rotate 90 $file $east_file
$CONVERT -rotate 180 $file $south_file
$CONVERT -rotate 270 $file $west_file

# OCR each (just append ".txt" to the path/name of the image)
north_text="$north_file.txt"
east_text="$east_file.txt"
south_text="$south_file.txt"
west_text="$west_file.txt"

$OCRAD -f -F utf8 $north_file -o $north_text
$OCRAD -f -F utf8 $east_file -o $east_text
$OCRAD -f -F utf8 $south_file -o $south_text
$OCRAD -f -F utf8 $west_file -o $west_text

# Get the word count for each txt file (least 'words' == least whitespace junk
# resulting from vertical lines of text that should be horizontal.)
wc_table="$TMP/wc_table"
echo "$($WC -w $north_text) $north_file" > $wc_table
echo "$($WC -w $east_text) $east_file" >> $wc_table
echo "$($WC -w $south_text) $south_file" >> $wc_table
echo "$($WC -w $west_text) $west_file" >> $wc_table

# Take the bottom two; these are likely right side up and upside down, but 
# generally too close to call beyond that.
bottom_two_wc_table="$TMP/bottom_two_wc_table"
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table

# Spellcheck. The lowest number of misspelled words is most likely the 
# correct orientation.
misspelled_words_table="$TMP/misspelled_words_table"
while read record; do
    txt=$(echo $record | $AWK '{ print $2 }')
    misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w)
    echo "$misspelled_word_count $record" >> $misspelled_words_table
done < $bottom_two_wc_table

# Do the sort, overwrite the input file, save out the text
winner=$($SORT -n $misspelled_words_table | $HEAD -1)
rotated_file=$(echo $winner | $AWK '{ print $4 }')

mv $rotated_file $file

# Clean up.
if [ -d $TMP ]; then
    rm -r $TMP
fi

Upvotes: 2

Related Questions