Reputation: 2877
I have imported a PDF into R and I need to read certain rows in this large PDF. The PDF file has been imported using pdftools
and the object is of class character with 1:10353 rows.
nrow(PDF)
NULL
class(PDF)
[1] "character"
str(PDF)
chr [1:10353] "Itemized Statement For:" "Patient Name: SMITH ,JOHN" "POLICY ID: 000000000" ...
I need to read in the following lines PDF.clean <-PDF[c(7:38,47:78,87:118............)]
From above, the lines start between 7:38 and then repeat by adding 40 to these initial values until the end of the document is reached.
Is there a smart way that I can set initial seeds such as x = 7 and y = 38 and then add 40 to each last value until such time as the values don't exceed 10353 and build up a subset clause this way?
Upvotes: 0
Views: 39
Reputation: 389335
You can create a sequence from 0 to end
value with a step of 40 and add it to 7:38
to get all the indices that you want to extract. Remove those indices which are greater than end.
end <- 10353
inds <- c(sapply(seq(0, end, 40), `+`, 7:38))
inds <- inds[inds <= end]
head(inds, 35)
# [1] 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
#[25] 31 32 33 34 35 36 37 38 47 48 49
tail(inds, 35)
# [1] 10311 10312 10313 10314 10315 10316 10317 10318 10327 10328 10329 10330
#[13] 10331 10332 10333 10334 10335 10336 10337 10338 10339 10340 10341 10342
#[25] 10343 10344 10345 10346 10347 10348 10349 10350 10351 10352 10353
You can use this to subset data from PDF
.
PDF.clean <- PDF[inds, ]
Upvotes: 1