Reputation: 459
I have a pdf file with multiple pages, but I am interested in only a subgroup of them. For example, my original PDF has 30 pages and I want only the pages 10 to 16.
I tried using the function split_pdf from tabulizer package, that only splits the pdf page to page (resulting in 200 files, one for each page), followed by merge_pdfs(which merge pdf files). It worked properly, but is taking ages (and I have around 2000 pdf files I have to split).
This is the code I am using:
split = split_pdf('file_path')
start = 10
end = 16
merge_pdfs(split[start:end], 'saving_path')
I couldn't find any better option to do this. Any help would appreciated.
Upvotes: 8
Views: 7331
Reputation: 103
As an accessory to G.Grothendieck's answer, one could also use the package staplr
, which is an R wrapper around the program pdftk
:
library('staplr')
staplr::select_pages(
selpages = 10:16,
input_filepath = 'file_path',
output_filepath = 'saving_path')
In my experience, plain pdftk
works faster. But, if you need to do something complex and you are more familiar with R syntax than with bash syntax, using the staplr
package will save on coding time.
Upvotes: 4
Reputation: 12410
Unfortunatly, I find it a bit unclear what kind of data is in your PDF and what you are trying to extract from it. So I outline two approaches.
If you have tables in the pdf, you should be able to extract the data from said pages using using:
tab <- tabulizer::extract_tables(file = "path/file.pdf", pages = 10:16)
If you only want the text, you should use pdftools
which is a lot faster:
text <- pdftools::pdf_text("path/file.pdf")[10:16]
Upvotes: 5
Reputation: 269481
Install pdftk
(if you don't already have it). Assuming it is on your path and myfile.pdf
is in the current directory run this from R:
system("pdftk myfile.pdf cat 10-16 output myfile_10to16.pdf")
Upvotes: 2