Extract text from PDF in code

Question

I'm making an app for my school which people can check with if they've got a schedule change. All schedule changes are listed here: http://www.augustinianum.eu/roosterwijzigingen/14062012.pdf. I want to search that page for a keyword (the user's group, which is entered in an EditText). I've found out how to make the app check if the edittext matches a certain string, so now I only need to download all of the text on that page to a string. But the problem is that it's not a simple webpage, but a PDFpage. I've heard that you need a special pdf library or something to extract the text from the PDF and then put that text into a string and then search the string for keywords using contains(). However I've got some questions about that:

This PDF is made with a PDF-creator, it's not a scanned page or so. You can actually for example select the text or search it for keywords using CTRL+F. So I wonder if it is actually required to extract the PDF and stuff or is there maybe an easier way.
I want the app to check for changes every, let's say hour. So it also has to download the PDF and extract the text every hour (about 8 pages), would that consume very much juice?
I've heard that there are many many libraries which do what I want. So which should I use? (If possible, I'd like one which is free :))
Could anyone explain to me how to use it in my code? (I'm not really experienced, so plz keep it a little easy :))

THANK YOU ALL SO MUCH!!!

Ruben Kazumov · Accepted Answer

Unfortunately, I did not working with java and you have to implement it in java code by yourself. Now I'll tell you, how finally I did it:

1) I took the file by your link. PHP is doing it by @fopen("http://...").

2) I opened it as a binary (it is important) and extracted two parts:

2.1) Data 3 0 obj part, which represents creation and modification dates. I did it by regex. It was simple and I mention it above.

2.1) Data stream from 5 0 obj, which represents the deflated data. IMPORTANT! Microsoft Excel inserts two bytes 0D 0A as a line break. Do not forget it, when you filtering the content by regexp. This bytes in the start and in the end have not to be included in extracted string.

3) I inflate a coded stuff by function $uncompressed = @gzuncompress($compressed) and put it in external file. You can see results there

4) Funniest part. The raw data inside the file in textual format. It looks like [(V)-4(RI)16(J)] TJ, and means VRIJ. You can read about texts in PDF in the PDF Reference v1.7, part 5.

5) I believe, the regular expressions can help you extract or/and transform the data.

IMPORTANT: I said "data stream from 5 0 obj", but number of the object "is subject of change". You must control the reference to the object from dictionary->pages->page->content chain. Description of the "bread crumbs" you can find in the manual I mentioned above.

Unfortunately, Excel do not embed any table structure in the PDF, but you can find the coordinates of the text portions and interprete it. Anyway it is a mess.

Do you think, dear Merlin, it is hard? No, dear, it is not. It is not hard, because there is no unicode symbols. The unicode in the PDF is THE REAL SUCK!

Good luck!

Extract text from PDF in code

Answers (2)

Related Questions