Manish
Manish

Reputation: 3521

How to write shell script for finding number of pages in PDF?

I am generating a PDF dynamically. How can I check the number of pages in the PDF using a shell script?

Upvotes: 53

Views: 29696

Answers (15)

K J
K J

Reputation: 11737

Word of CAUTION

Most PDF parsers will not see the correct number of pages in ALL PDF formats as there are PDF collections that will report just the surface cover page so Pages=1 or not correctly know the number of XFA pages in a Form and often show those as just PDF pages = 1

As an example here without showing all the apps that fail.

> PDFexecutable -info ..\document-portfolio.pdf | findstr /i "pages:"  

Pages: 1

Whereas Ghostscript can query all the attachments. Here one cover PDF and 4 other files in the collection (a total of 150 PDF pages in one PDF)

>bin\gs -q -dBATCH -dPDFINFO ..\..\document-portfolio.pdf 2> out.txt & findstr  "has" out.txt

        File has 1 page.
        File has 17 pages
        File has 2 pages
        File has 19 pages
        File has 111 pages

enter image description here

Upvotes: 0

symmetry
symmetry

Reputation: 529

and another mutool solution using mutool run:

make a file myscript.js containing,

var doc = Document.openDocument(scriptArgs[0]);
var n = doc.countPages();

print(n, "pages");

To run,

mutool run myscript.js mypdf.pdf

Upvotes: 0

Gruber
Gruber

Reputation: 2298

A super quick but effective alternative is the great exiftool program.

exiftool -FileName -PageCount -T file.pdf

For ex. with file.pdf having 5 pages the ouptut will be:

file.pdf    5

Extra bonus:
create a text file with all pdf files and page count in current directory

exiftool -FileName -PageCount -T -ext pdf . > report.txt

can recursively scan sub folders with -r flag

exiftool -FileName -PageCount -T -r -ext pdf . > report.txt

Upvotes: 0

foolishgrunt
foolishgrunt

Reputation: 103

QPDF offers the most straightforward method I'm aware of.

qpdf --show-npages input.pdf

Upvotes: 2

cotrane
cotrane

Reputation: 189

Another mutool solution making better use of the options:

mutool show file.pdf Root/Pages/Count

Upvotes: 3

Gabriel Staples
Gabriel Staples

Reputation: 52449

Here is a total hack using pdftoppm, which comes preinstalled on Ubuntu (tested on Ubuntu 18.04 and 20.04 at least):

# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

How does this work? Well, if you specify a first page which is larger than the pages in the PDF (I specify page number 1000000, which is too large for all known PDFs), it will print the following error to stderr:

Wrong page range given: the first page (1000000) can not be after the last page (142).

So, I pipe that stderr msg to stdout with 2>&1, as explained here, then I pipe that to grep to match the (142). part with this regular expression (([0-9]*)\.$), then I pipe that to grep again with this regular expression ([0-9]*) to find just the number, which is 142 in this case. That's it!

Wrapper functions and speed testing

Here are a couple wrapper functions to test these:

# get the total number of pages in a PDF; technique 1.
# See this ans here: https://stackoverflow.com/a/14736593/4561887
# Usage (works on ALL PDFs--whether password-protected or not!):
#       num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
    _pdf="$1"

    _num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
        | sort -rn | head -n 1)"

    echo "$_num_pgs"
}

# get the total number of pages in a PDF; technique 2.
# See my ans here: https://stackoverflow.com/a/66963293/4561887
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
#       num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
    _pdf="$1"
    _password="$2"

    if [ -n "$_password" ]; then
        _password="-upw $_password"
    fi

    _num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
        | grep -o '[0-9]*')"

    echo "$_num_pgs"
}

Testing them with the time command in front shows that the strings one is extremely slow, taking ~0.200 sec on a 142 pg pdf, whereas the pdftoppm one is very fast, taking ~0.020 sec or less on the same pdf. The pdfinfo technique in Ocaso's answer below is also very fast--the same as the pdftoppm one.

See also

  1. These awesome answers by Ocaso Protal.
  2. These functions above will be used in my pdf2searchablepdf project here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.

Upvotes: 5

Ocaso Protal
Ocaso Protal

Reputation: 20237

Without any extra package:

strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
    | sort -rn | head -n 1

Using pdfinfo:

pdfinfo file.pdf | awk '/^Pages:/ {print $2}'

Using pdftk:

pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'

You can also recursively sum the total number of pages in all PDFs via pdfinfo as follows:

find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
    awk '/^Pages:/ {n += $2} END {print n}'

Upvotes: 86

user2616155
user2616155

Reputation: 348

To build on Marius Hofert's answer, this command uses a bash for loop to show you the number of pages, display the filename, and it will ignore the case of the file extension.

for f in *.[pP][dD][fF]; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; done

Upvotes: -1

Farid Cheraghi
Farid Cheraghi

Reputation: 747

mupdf/mutool solution:

mutool info tmp.pdf | grep '^Pages' | cut -d ' ' -f 2

Upvotes: 3

Gerrit Griebel
Gerrit Griebel

Reputation: 453

If you're on macOS you can query pdf metadata like this:

mdls -name kMDItemNumberOfPages -raw file.pdf

as seen here https://apple.stackexchange.com/questions/225175/get-number-of-pdf-pages-in-terminal

Upvotes: 3

Leonardo Sapiras
Leonardo Sapiras

Reputation: 19

I made a few improvement in Marius Hofert tip to sum the returned values.

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done | awk '{s+=$1}END{print s}'

Upvotes: 0

mathlete
mathlete

Reputation: 6682

Here is a version for the command line directly (based on pdfinfo):

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done

Upvotes: 8

ikaerom
ikaerom

Reputation: 727

Just dug out an old script (in ksh) I found:

#!/usr/bin/env ksh
# Usage: pdfcount.sh file.pdf
#
# Optimally, this would be a mere:
#       pdfinfo file.pdf | grep Pages | sed 's/[^0-9]*//'

[[ "$#" != "1" ]] && {
   printf "ERROR: No file specified\n"
   exit 1
}

numpages=0
while read line; do
   num=${line/*([[:print:]])+(Count )?(-)+({1,4}(\d))*([[:print:]])/\4}
   (( num > numpages)) && numpages=$num
done < <(strings "$@" | grep "/Count")
print $numpages

Upvotes: 1

Lacobus
Lacobus

Reputation: 1658

The pdftotext utility converts a pdf file to text format inserting page breaks between the pages. (aka: form-feed characters $'\f' ):

NAME
       pdftotext - Portable Document Format (PDF) to text converter.

SYNOPSIS
       pdftotext [options] [PDF-file [text-file]]

DESCRIPTION
       Pdftotext converts Portable Document Format (PDF) files to plain text.

       Pdftotext  reads  the PDF file, PDF-file, and writes a text file, text-file.  If text-file is
       not specified, pdftotext converts file.pdf to file.txt.  If text-file is  ´-',  the  text  is
       sent to stdout.

There are many combinations to solve your problem, choose one of them:

1) pdftotext + grep:

$ pdftotext file.pdf - | grep -c $'\f'

2) pdftotext + awk (v1):

$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'

3) pdftotext + awk (v2):

$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'

4) pdftotext + awk (v3):

$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'

Hope it Helps!

Upvotes: 8

np0x
np0x

Reputation: 166

The imagemagick library provides a tool called identify which in conjunction with counting the lines of output gets you what you are after...imagemagick is a easy install on osx with brew.

Here is a functional bash script that captures it to a shell variable and dumps it back to the screen...

#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"

And the output of running it...

$ ./countPages.sh aSampleFile.pdf 
Processing aSampleFile.pdf
The number of pages is: 2
$ 

Upvotes: 9

Related Questions