Reputation: 7303
The file srcfile.pdf
has a variable number of roman-numerated pages (i, ii, iii, etc) and the following arabic-numerated pages (1, 2, 3, ..., n).
How to extract only arabic-numbered pages (e.g. #1 to #10)?
The following command extracts pages i, ii, iii, 1, 2, etc.
qpdf --empty --pages srcfile.pdf 1-10 -- targetfile.pdf
Is it possible to extract only pages 1, 2, 3, etc.?
Upvotes: 0
Views: 66
Reputation: 7303
qpdf
has an option --json
to generate a json representation of the file.
With this option there is a workaround using a json parser like e.g. jq:
With the following bash script "relative" pages in a pagelabel can be converted to absolute pages:
cat <<'EOF' | tee relpage.sh
#!/bin/bash
PDFFILE=$1
STR=$2
PAGELABENR=${STR%:*}
PAGENRINLABEL=${STR#*:}
PAGELABELINDEX=$(qpdf --json ${PDFFILE} | jq -r .pagelabels[$PAGELABENR].index)
ABSPAGENR=$(($PAGELABELINDEX+$PAGENRINLABEL))
echo $ABSPAGENR
EOF
chmod +x relpage.sh
Usage: ./relpage.sh inputfile.pdf 1:17
. Note that pagelabels are 0-based.
To extract pages 17 to 39 in the pagelabel 1 use following command:
qpdf \
--empty \
--pages \
${INPUTFILE} \
$(./relpage.sh ${INPUTFILE} 1:17)-$(./relpage.sh ${INPUTFILE} 1:39) \
-- \
output.pdf
To get the pagelabels info just use qpdf --json --json-key=pagelabels inputfile.pdf
or the following
$ INPUTFILE=inpufile.pdf
$ PAGELABELSLENGHT=$(qpdf --json ${INPUTFILE} | jq -r '.pagelabels | length')
$ echo "file '${INPUTFILE}' has ${PAGELABELSLENGHT} pagelabels"
$ for i in $(seq 0 $((${PAGELABELSLENGHT}-1))); \
do echo "pagelevel #$i starts at index #$(qpdf --json ${INPUTFILE} | jq -r .pagelabels[$i].index)"; \
done
file 'inpufile.pdf' has 2 pagelabels
pagelevel #0 starts at index #0
pagelevel #1 starts at index #4
Upvotes: 0