Reputation: 3521
I am generating a PDF dynamically. How can I check the number of pages in the PDF using a shell script?
Upvotes: 53
Views: 29696
Reputation: 11737
Word of CAUTION
Most PDF parsers will not see the correct number of pages in ALL PDF formats as there are PDF collections that will report just the surface cover page so Pages=1 or not correctly know the number of XFA pages in a Form and often show those as just PDF pages = 1
As an example here without showing all the apps that fail.
> PDFexecutable -info ..\document-portfolio.pdf | findstr /i "pages:"
Pages: 1
Whereas Ghostscript can query all the attachments. Here one cover PDF and 4 other files in the collection (a total of 150 PDF pages in one PDF)
>bin\gs -q -dBATCH -dPDFINFO ..\..\document-portfolio.pdf 2> out.txt & findstr "has" out.txt
File has 1 page.
File has 17 pages
File has 2 pages
File has 19 pages
File has 111 pages
Upvotes: 0
Reputation: 529
and another mutool solution using mutool run:
make a file myscript.js containing,
var doc = Document.openDocument(scriptArgs[0]);
var n = doc.countPages();
print(n, "pages");
To run,
mutool run myscript.js mypdf.pdf
Upvotes: 0
Reputation: 2298
A super quick but effective alternative is the great exiftool program.
exiftool -FileName -PageCount -T file.pdf
For ex. with file.pdf having 5 pages the ouptut will be:
file.pdf 5
Extra bonus:
create a text file with all pdf files and page count in current directory
exiftool -FileName -PageCount -T -ext pdf . > report.txt
can recursively scan sub folders with -r
flag
exiftool -FileName -PageCount -T -r -ext pdf . > report.txt
Upvotes: 0
Reputation: 103
QPDF offers the most straightforward method I'm aware of.
qpdf --show-npages input.pdf
Upvotes: 2
Reputation: 189
Another mutool solution making better use of the options:
mutool show file.pdf Root/Pages/Count
Upvotes: 3
Reputation: 52449
Here is a total hack using pdftoppm
, which comes preinstalled on Ubuntu (tested on Ubuntu 18.04 and 20.04 at least):
# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'
# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'
How does this work? Well, if you specify a f
irst page which is larger than the pages in the PDF (I specify page number 1000000
, which is too large for all known PDFs), it will print the following error to stderr
:
Wrong page range given: the first page (1000000) can not be after the last page (142).
So, I pipe that stderr
msg to stdout
with 2>&1
, as explained here, then I pipe that to grep to match the (142).
part with this regular expression (([0-9]*)\.$
), then I pipe that to grep again with this regular expression ([0-9]*
) to find just the number, which is 142
in this case. That's it!
Here are a couple wrapper functions to test these:
# get the total number of pages in a PDF; technique 1.
# See this ans here: https://stackoverflow.com/a/14736593/4561887
# Usage (works on ALL PDFs--whether password-protected or not!):
# num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
_pdf="$1"
_num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1)"
echo "$_num_pgs"
}
# get the total number of pages in a PDF; technique 2.
# See my ans here: https://stackoverflow.com/a/66963293/4561887
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
# num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
_pdf="$1"
_password="$2"
if [ -n "$_password" ]; then
_password="-upw $_password"
fi
_num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*')"
echo "$_num_pgs"
}
Testing them with the time
command in front shows that the strings
one is extremely slow, taking ~0.200 sec on a 142 pg pdf, whereas the pdftoppm
one is very fast, taking ~0.020 sec or less on the same pdf. The pdfinfo
technique in Ocaso's answer below is also very fast--the same as the pdftoppm
one.
pdf2searchablepdf
project here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.Upvotes: 5
Reputation: 20237
Without any extra package:
strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
| sort -rn | head -n 1
Using pdfinfo:
pdfinfo file.pdf | awk '/^Pages:/ {print $2}'
Using pdftk:
pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'
You can also recursively sum the total number of pages in all PDFs via pdfinfo as follows:
find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
awk '/^Pages:/ {n += $2} END {print n}'
Upvotes: 86
Reputation: 348
To build on Marius Hofert's answer, this command uses a bash for loop to show you the number of pages, display the filename, and it will ignore the case of the file extension.
for f in *.[pP][dD][fF]; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; done
Upvotes: -1
Reputation: 747
mupdf/mutool solution:
mutool info tmp.pdf | grep '^Pages' | cut -d ' ' -f 2
Upvotes: 3
Reputation: 453
If you're on macOS you can query pdf metadata like this:
mdls -name kMDItemNumberOfPages -raw file.pdf
as seen here https://apple.stackexchange.com/questions/225175/get-number-of-pdf-pages-in-terminal
Upvotes: 3
Reputation: 19
I made a few improvement in Marius Hofert tip to sum the returned values.
for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done | awk '{s+=$1}END{print s}'
Upvotes: 0
Reputation: 6682
Here is a version for the command line directly (based on pdfinfo):
for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done
Upvotes: 8
Reputation: 727
Just dug out an old script (in ksh) I found:
#!/usr/bin/env ksh
# Usage: pdfcount.sh file.pdf
#
# Optimally, this would be a mere:
# pdfinfo file.pdf | grep Pages | sed 's/[^0-9]*//'
[[ "$#" != "1" ]] && {
printf "ERROR: No file specified\n"
exit 1
}
numpages=0
while read line; do
num=${line/*([[:print:]])+(Count )?(-)+({1,4}(\d))*([[:print:]])/\4}
(( num > numpages)) && numpages=$num
done < <(strings "$@" | grep "/Count")
print $numpages
Upvotes: 1
Reputation: 1658
The pdftotext
utility converts a pdf file to text format inserting page breaks between the pages. (aka: form-feed characters $'\f'
):
NAME
pdftotext - Portable Document Format (PDF) to text converter.
SYNOPSIS
pdftotext [options] [PDF-file [text-file]]
DESCRIPTION
Pdftotext converts Portable Document Format (PDF) files to plain text.
Pdftotext reads the PDF file, PDF-file, and writes a text file, text-file. If text-file is
not specified, pdftotext converts file.pdf to file.txt. If text-file is ´-', the text is
sent to stdout.
There are many combinations to solve your problem, choose one of them:
1) pdftotext + grep:
$ pdftotext file.pdf - | grep -c $'\f'
2) pdftotext + awk (v1):
$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'
3) pdftotext + awk (v2):
$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'
4) pdftotext + awk (v3):
$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'
Hope it Helps!
Upvotes: 8
Reputation: 166
The imagemagick library provides a tool called identify which in conjunction with counting the lines of output gets you what you are after...imagemagick is a easy install on osx with brew.
Here is a functional bash script that captures it to a shell variable and dumps it back to the screen...
#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"
And the output of running it...
$ ./countPages.sh aSampleFile.pdf
Processing aSampleFile.pdf
The number of pages is: 2
$
Upvotes: 9