Reputation: 3531

How to write shell script for finding number of pages in PDF?

I am generating a PDF dynamically. How can I check the number of pages in the PDF using a shell script?

Upvotes: 54

Answers (15)

K J

Reputation: 11939

Word of CAUTION

Most PDF parsers will not see the correct number of pages in ALL PDF formats as there are PDF collections that will report just the surface cover page so Pages=1 or not correctly know the number of XFA pages in a Form and often show those as just PDF pages = 1

As an example here without showing all the apps that fail.

> PDFexecutable -info ..\document-portfolio.pdf | findstr /i "pages:"  

Pages: 1

Whereas Ghostscript can query all the attachments. Here one cover PDF and 4 other files in the collection (a total of 150 PDF pages in one PDF)

>bin\gs -q -dBATCH -dPDFINFO ..\..\document-portfolio.pdf 2> out.txt & findstr  "has" out.txt

        File has 1 page.
        File has 17 pages
        File has 2 pages
        File has 19 pages
        File has 111 pages

Upvotes: 0

symmetry

Reputation: 529

and another mutool solution using mutool run:

make a file myscript.js containing,

var doc = Document.openDocument(scriptArgs[0]);
var n = doc.countPages();

print(n, "pages");

To run,

mutool run myscript.js mypdf.pdf

Upvotes: 0

Gruber

Reputation: 2308

A super quick but effective alternative is the great exiftool program.

exiftool -FileName -PageCount -T file.pdf

For ex. with file.pdf having 5 pages the ouptut will be:

file.pdf    5

Extra bonus:
create a text file with all pdf files and page count in current directory

exiftool -FileName -PageCount -T -ext pdf . > report.txt

can recursively scan sub folders with -r flag

exiftool -FileName -PageCount -T -r -ext pdf . > report.txt

Upvotes: 0

foolishgrunt

Reputation: 103

QPDF offers the most straightforward method I'm aware of.

qpdf --show-npages input.pdf

Upvotes: 2

cotrane

Reputation: 189

Another mutool solution making better use of the options:

mutool show file.pdf Root/Pages/Count

Upvotes: 3

Gabriel Staples

Reputation: 53175

Here is a total hack using pdftoppm, which comes preinstalled on Ubuntu (tested on Ubuntu 18.04 and 20.04 at least):

# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

How does this work? Well, if you specify a first page which is larger than the pages in the PDF (I specify page number 1000000, which is too large for all known PDFs), it will print the following error to stderr:

Wrong page range given: the first page (1000000) can not be after the last page (142).

So, I pipe that stderr msg to stdout with 2>&1, as explained here, then I pipe that to grep to match the (142). part with this regular expression (([0-9]*)\.$), then I pipe that to grep again with this regular expression ([0-9]*) to find just the number, which is 142 in this case. That's it!

Wrapper functions and speed testing

Here are a couple wrapper functions to test these:

# get the total number of pages in a PDF; technique 1.
# See this ans here: https://stackoverflow.com/a/14736593/4561887
# Usage (works on ALL PDFs--whether password-protected or not!):
#       num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
    _pdf="$1"

    _num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
        | sort -rn | head -n 1)"

    echo "$_num_pgs"
}

# get the total number of pages in a PDF; technique 2.
# See my ans here: https://stackoverflow.com/a/66963293/4561887
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
#       num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
    _pdf="$1"
    _password="$2"

    if [ -n "$_password" ]; then
        _password="-upw $_password"
    fi

    _num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
        | grep -o '[0-9]*')"

    echo "$_num_pgs"
}

Testing them with the time command in front shows that the strings one is extremely slow, taking ~0.200 sec on a 142 pg pdf, whereas the pdftoppm one is very fast, taking ~0.020 sec or less on the same pdf. The pdfinfo technique in Ocaso's answer below is also very fast--the same as the pdftoppm one.

#!/usr/bin/env ksh
# Usage: pdfcount.sh file.pdf
#
# Optimally, this would be a mere:
#       pdfinfo file.pdf | grep Pages | sed 's/[^0-9]*//'

[[ "$#" != "1" ]] && {
   printf "ERROR: No file specified\n"
   exit 1
}

numpages=0
while read line; do
   num=${line/*([[:print:]])+(Count )?(-)+({1,4}(\d))*([[:print:]])/\4}
   (( num > numpages)) && numpages=$num
done < <(strings "$@" | grep "/Count")
print $numpages

Upvotes: 1

Lacobus

Reputation: 1658

The pdftotext utility converts a pdf file to text format inserting page breaks between the pages. (aka: form-feed characters $'\f' ):

NAME
       pdftotext - Portable Document Format (PDF) to text converter.

SYNOPSIS
       pdftotext [options] [PDF-file [text-file]]

DESCRIPTION
       Pdftotext converts Portable Document Format (PDF) files to plain text.

       Pdftotext  reads  the PDF file, PDF-file, and writes a text file, text-file.  If text-file is
       not specified, pdftotext converts file.pdf to file.txt.  If text-file is  ´-',  the  text  is
       sent to stdout.

There are many combinations to solve your problem, choose one of them:

1) pdftotext + grep:

$ pdftotext file.pdf - | grep -c $'\f'

2) pdftotext + awk (v1):

$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'

3) pdftotext + awk (v2):

$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'

4) pdftotext + awk (v3):

$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'

Hope it Helps!

Upvotes: 8

np0x

Reputation: 166

The imagemagick library provides a tool called identify which in conjunction with counting the lines of output gets you what you are after...imagemagick is a easy install on osx with brew.

Here is a functional bash script that captures it to a shell variable and dumps it back to the screen...

#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"

And the output of running it...

$ ./countPages.sh aSampleFile.pdf 
Processing aSampleFile.pdf
The number of pages is: 2
$

Upvotes: 9

How to write shell script for finding number of pages in PDF?

Answers (15)

Wrapper functions and speed testing

See also

Related Questions