Extract text from pdf file based on regular expression?

Question

i have a pdf file which have number of 300 pages, and each set of pages contains identifying information for a person such as the social security number.

let's say that pages from 1-4 are for the social number 987-65-4320 and pages from 5-6 are for 987-65-4321

i want to extract all the information for the first employee starting from the first social number position to the second social number position then save them in a new pdf file.

all the examples i saw was about extracting all the text from pdf file, not based on specific criteria like this one:

extract text from pdf files

please advise how to accomplish that.

Carl Walsh · Accepted Answer

This isn't an automated technique, but can you get the text (I might just copy-paste the pdf into a text file), and use a regular expression to find the information you want?

In Java, some of the parsing could look like:

// Matches 3 digits, a dash, 2 digits, a dash, and four digits, and then all text
// until it finds another SSN
String text = "987-65-4320 some info 987-65-4321 other 
info";
Pattern p = Pattern.compile("(\d{3}-\d{2}-\d{4})((?:.(?!\d{3}-\d{2}-\d{4}))*)", Pattern.DOTALL);
Matcher m = p.matcher(text);
while (m.find())
    System.out.println(m.group(1) + ": " + m.group(2));

but without seeing the information you want to save I couldn't help you with getting it.

If I wanted a new PDF I would put the information into Microsoft Word or Google Docs and save a PDF.

Alternatively, if all you want is to to "extract all the information" from a range of employees, then would it work to create a copy of the original PDF with some pages removed? I've seen websites that let you do that, but Chrome's (you can use it to open local PDFs without a problem) print dialogue will let you specify a range of pages, and save it as a PDF.

Extract text from pdf file based on regular expression?

Answers (1)

Related Questions