Reputation: 7460
Many hours have I searched for a fast and easy, but mostly accurate, way to get the number of pages in a PDF document. Since I work for a graphic printing and reproduction company that works a lot with PDFs, the number of pages in a document must be precisely known before they are processed. PDF documents come from many different clients, so they aren't generated with the same application and/or don't use the same compression method.
Here are some of the answers I found insufficient or simply NOT working:
Imagick requires a lot of installation, apache needs to restart, and when I finally had it working, it took amazingly long to process (2-3 minutes per document) and it always returned 1
page in every document (haven't seen a working copy of Imagick so far), so I threw it away. That was with both the getNumberImages()
and identifyImage()
methods.
FPDI is easy to use and install (just extract files and call a PHP script), BUT many of the compression techniques are not supported by FPDI. It then returns an error:
FPDF error: This document (test_1.pdf) probably uses a compression technique which is not supported by the free parser shipped with FPDI.
This opens the PDF file in a stream and searches for some kind of string, containing the pagecount or something similar.
$f = "test1.pdf";
$stream = fopen($f, "r");
$content = fread ($stream, filesize($f));
if(!$stream || !$content)
return 0;
$count = 0;
// Regular Expressions found by Googling (all linked to SO answers):
$regex = "/\/Count\s+(\d+)/";
$regex2 = "/\/Page\W*(\d+)/";
$regex3 = "/\/N\s+(\d+)/";
if(preg_match_all($regex, $content, $matches))
$count = max($matches);
return $count;
/\/Count\s+(\d+)/
(looks for /Count <number>
) doesn't work because only a few documents have the parameter /Count
inside, so most of the time it doesn't return anything. Source./\/Page\W*(\d+)/
(looks for /Page<number>
) doesn't get the number of pages, mostly contains some other data. Source./\/N\s+(\d+)/
(looks for /N <number>
) doesn't work either, as the documents can contain multiple values of /N
; most, if not all, not containing the pagecount. Source.So, what does work reliable and accurate?
Upvotes: 82
Views: 135235
Reputation: 53182
This works fine in Imagemagick.
convert image.pdf -format "%n\n" info: | head -n 1
Upvotes: 1
Reputation: 143
I got problems with imagemagick installations on production server. After hours of attempts, I decided to get rid of IM, and found another approach:
Install poppler-utils:
$ sudo apt install poppler-utils [On Debian/Ubuntu & Mint]
$ sudo dnf install poppler-utils [On RHEL/CentOS & Fedora]
$ sudo zypper install poppler-tools [On OpenSUSE]
$ sudo pacman -S poppler [On Arch Linux]
Then execute via shell in your PL ( e.g. PHP):
shell_exec("pdfinfo $filePath | grep Pages | cut -f 2 -d':' | xargs");
Upvotes: 3
Reputation: 7460
It is downloadable for Linux and Windows. You download a compressed file containing several little PDF-related programs. Extract it somewhere.
One of those files is pdfinfo (or pdfinfo.exe for Windows). An example of data returned by running it on a PDF document:
Title: test1.pdf
Author: John Smith
Creator: PScript5.dll Version 5.2.2
Producer: Acrobat Distiller 9.2.0 (Windows)
CreationDate: 01/09/13 19:46:57
ModDate: 01/09/13 19:46:57
Tagged: yes
Form: none
Pages: 13 <-- This is what we need
Encrypted: no
Page size: 2384 x 3370 pts (A0)
File size: 17569259 bytes
Optimized: yes
PDF version: 1.6
I haven't seen a PDF document where it returned a false pagecount (yet). It is also really fast, even with big documents of 200+ MB the response time is a just a few seconds or less.
There is an easy way of extracting the pagecount from the output, here in PHP:
// Make a function for convenience
function getPDFPages($document)
{
$cmd = "/path/to/pdfinfo"; // Linux
$cmd = "C:\\path\\to\\pdfinfo.exe"; // Windows
// Parse entire output
// Surround with double quotes if file name has spaces
exec("$cmd \"$document\"", $output);
// Iterate through lines
$pagecount = 0;
foreach($output as $op)
{
// Extract the number
if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1)
{
$pagecount = intval($matches[1]);
break;
}
}
return $pagecount;
}
// Use the function
echo getPDFPages("test 1.pdf"); // Output: 13
Of course this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.
I know its not pure PHP, but external programs are way better in PDF handling (as seen in the question).
I hope this can help people, because I have spent a whole lot of time trying to find the solution to this and I have seen a lot of questions about PDF pagecount in which I didn't find the answer I was looking for. That's why I made this question and answered it myself.
Security Notice: Use escapeshellarg
on $document
if document name is being fed from user input or file uploads.
Upvotes: 115
Reputation: 1
Often you read regex /\/Page\W/
but it won't work for me for several pdf files.
So here is an other regex expression, that works for me.
$pdf = file_get_contents($path_pdf);
return preg_match_all("/[<|>][\r\n|\r|\n]*\/Type\s*\/Page\W/", $path_pdf, $dummy);
Upvotes: -1
Reputation: 161
You can use mutool
.
mutool show FILE.pdf trailer/Root/Pages/Count
mutool
is part of the MuPDF software package.
Upvotes: 1
Reputation: 153
this simple 1 liner seems to do the job well:
strings $path_to_pdf | grep Kids | grep -o R | wc -l
there is a block in the PDF file which details the number of pages in this funky string:
/Kids [3 0 R 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R 22 0 R 23 0 R 24 0 R 25 0 R 26 0 R 27 0 R 28 0 R 29 0 R 30 0 R 31 0 R 32 0 R 33 0 R 34 0 R 35 0 R 36 0 R 37 0 R 38 0 R 39 0 R 40 0 R 41 0 R]
The number of 'R' characters is the number of pages
screenshot of terminal showing output from strings
Upvotes: 1
Reputation: 4880
Here is a simple example to get the number of pages in PDF with PHP.
<?php
function count_pdf_pages($pdfname) {
$pdftext = file_get_contents($pdfname);
$num = preg_match_all("/\/Page\W/", $pdftext, $dummy);
return $num;
}
$pdfname = 'example.pdf'; // Put your PDF path
$pages = count_pdf_pages($pdfname);
echo $pages;
?>
Upvotes: 7
Reputation: 749
I created a wrapper class for pdfinfo in case it's useful to anyone, based on Richard's answer@
/**
* Wrapper for pdfinfo program, part of xpdf bundle
* http://www.xpdfreader.com/about.html
*
* this will put all pdfinfo output into keyed array, then make them accessible via getValue
*/
class PDFInfoWrapper {
const PDFINFO_CMD = 'pdfinfo';
/**
* keyed array to hold all the info
*/
protected $info = array();
/**
* raw output in case we need it
*/
public $raw = "";
/**
* Constructor
* @param string $filePath - path to file
*/
public function __construct($filePath) {
exec(self::PDFINFO_CMD . ' "' . $filePath . '"', $output);
//loop each line and split into key and value
foreach($output as $line) {
$colon = strpos($line, ':');
if($colon) {
$key = trim(substr($line, 0, $colon));
$val = trim(substr($line, $colon + 1));
//use strtolower to make case insensitive
$this->info[strtolower($key)] = $val;
}
}
//store the raw output
$this->raw = implode("\n", $output);
}
/**
* get a value
* @param string $key - key name, case insensitive
* @returns string value
*/
public function getValue($key) {
return @$this->info[strtolower($key)];
}
/**
* list all the keys
* @returns array of key names
*/
public function getAllKeys() {
return array_keys($this->info);
}
}
Upvotes: 2
Reputation: 27486
You can use qpdf
like below. If a file file_name.pdf has 100 pages,
$ qpdf --show-npages file_name.pdf
100
Upvotes: 11
Reputation: 83437
Since you're ok with using command line utilities, you can use cpdf (Microsoft Windows/Linux/Mac OS X). To obtain the number of pages in one PDF:
cpdf.exe -pages "my file.pdf"
Upvotes: 2
Reputation: 3884
If you have access to shell, a simplest (but not usable on 100% of PDFs) approach would be to use grep
.
This should return just the number of pages:
grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf
Example: https://regex101.com/r/BrUTKn/1
Switches description:
-m 1
is neccessary as some files can have more than one match of regex pattern (volonteer needed to replace this with match-only-first regex solution extension)-a
is neccessary to treat the binary file as text-o
to show only the match-P
to use Perl regular expressionRegex explanation:
(?<=\/N )
lookbehind of /N
(nb. space character not seen here)\d+
any number of digits(?=\/)
lookahead of /
Nota bene: if in some case match is not found, it's safe to assume only 1 page exists.
Upvotes: 0
Reputation: 657
This seems to work pretty well, without the need for special packages or parsing command output.
<?php
$target_pdf = "multi-page-test.pdf";
$cmd = sprintf("identify %s", $target_pdf);
exec($cmd, $output);
$pages = count($output);
Upvotes: 2
Reputation: 4452
Simplest of all is using ImageMagick
here is a sample code
$image = new Imagick();
$image->pingImage('myPdfFile.pdf');
echo $image->getNumberImages();
otherwise you can also use PDF
libraries like MPDF
or TCPDF
for PHP
Upvotes: 33
Reputation: 372
The R package pdftools and the function pdf_info()
provides information on the number of pages in a pdf.
library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
info <- pdf_info(pdf_file)
nbpages <- info[2]
nbpages
$pages
[1] 65
Upvotes: 0
Reputation: 1
Here is a Windows command script using gsscript that reports the PDF file page number
@echo off
echo.
rem
rem this file: getlastpagenumber.cmd
rem version 0.1 from commander 2015-11-03
rem need Ghostscript e.g. download and install from http://www.ghostscript.com/download/
rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC
rem
:vars
set __gs__="C:\prg\ghostscript\bin\gswin64c.exe"
set __lastpagenumber__=1
set __pdffile__="%~1"
set __pdffilename__="%~n1"
set __datetime__=%date%%time%
set __datetime__=%__datetime__:.=%
set __datetime__=%__datetime__::=%
set __datetime__=%__datetime__:,=%
set __datetime__=%__datetime__:/=%
set __datetime__=%__datetime__: =%
set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp"
:check
if %__pdffile__%=="" goto error1
if not exist %__pdffile__% goto error2
if not exist %__gs__% goto error3
:main
%__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE -sstdout=%__tmpfile__% %__pdffile__%
FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A
set __lastpagenumber__=%__lastpagenumber__: =%
if exist %__tmpfile__% del %__tmpfile__%
:output
echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages
goto end
:error1
echo no pdf file selected
echo usage: %~n0 PDFFILE
goto end
:error2
echo no pdf file found
echo usage: %~n0 PDFFILE
goto end
:error3
echo.can not find the ghostscript bin file
echo. %__gs__%
echo.please download it from:
echo. http://www.ghostscript.com/download/
echo.and install to "C:\prg\ghostscript"
goto end
:end
exit /b
Upvotes: 0
Reputation: 89
Here is a R
function that reports the PDF file page number by using the pdfinfo
command.
pdf.file.page.number <- function(fname) {
a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2"))
page.number <- as.numeric(readLines(a))
close(a)
page.number
}
if (F) {
pdf.file.page.number("a.pdf")
}
Upvotes: 0
Reputation: 569
if you can't install any additional packages, you can use this simple one-liner:
foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)
Upvotes: 3