Reputation: 5114
Is this possible to extract the header and/or footer from a PDF document?
As I tried a few options (including PDFMiner, the Ruby gem pdf-extract, study the PDF format specs), I'm starting to suspect that the header/footer information is not available whatsoever.
(I would like to do this from Python, if possible, but any other alternative is viable.)
Upvotes: 4
Views: 6039
Reputation: 96064
Page headers and footers are not (at least not necessarily) located in some content part separate from the rest of the page content. Thus, in general there is no way to reliably extract headers and footers from PDFs.
It is possible, though, to try and use heuristics which look at the whole PDF contents and try to guess what parts are headers and/or footers.
If the PDFs you want to analyze are fairly homogeneous, e.g. all produced by the same publisher and looking alike, this might be feasible. The more divers your source PDFs are, though, the more complex your heuristics likely will become and the less accurate the results will be.
Upvotes: 8