Reputation: 33625
i have a pdf file which have number of 300 pages, and each set of pages contains identifying information for a person such as the social security number.
let's say that pages from 1-4 are for the social number 987-65-4320 and pages from 5-6 are for 987-65-4321
i want to extract all the information for the first employee starting from the first social number position to the second social number position then save them in a new pdf file.
all the examples i saw was about extracting all the text from pdf file, not based on specific criteria like this one:
please advise how to accomplish that.
Upvotes: 1
Views: 3588
Reputation: 7009
This isn't an automated technique, but can you get the text (I might just copy-paste the pdf into a text file), and use a regular expression to find the information you want?
In Java, some of the parsing could look like:
// Matches 3 digits, a dash, 2 digits, a dash, and four digits, and then all text
// until it finds another SSN
String text = "987-65-4320 some info 987-65-4321 other \ninfo";
Pattern p = Pattern.compile("(\\d{3}-\\d{2}-\\d{4})((?:.(?!\\d{3}-\\d{2}-\\d{4}))*)", Pattern.DOTALL);
Matcher m = p.matcher(text);
while (m.find())
System.out.println(m.group(1) + ": " + m.group(2));
but without seeing the information you want to save I couldn't help you with getting it.
If I wanted a new PDF I would put the information into Microsoft Word or Google Docs and save a PDF.
Alternatively, if all you want is to to "extract all the information" from a range of employees, then would it work to create a copy of the original PDF with some pages removed? I've seen websites that let you do that, but Chrome's (you can use it to open local PDFs without a problem) print dialogue will let you specify a range of pages, and save it as a PDF.
Upvotes: 1