Reputation: 9092
I know that there's a great source that works on iOS for PDF searching, it's PDFKitten
But my case is that I encounter some PDF files that this source don't work for search. I tried to open these file by 'Preview' app on Mac and tried to search, it works.
I uploaded one file here.
You can check by open this file by 'Preview' app and search the word 'ra'. It works perfect. By if you drag this file to the source PDFKitten and make some configurations so that the source open it, then try to search, it don't work.
I inspected the source, it cares all the text showing operator, including Tj, ', '', TJ. I placed some log lines in these operator's call backs and I saw these call backs are not called.
Can you give my some suggestions or any ideas?
Upvotes: 2
Views: 1058
Reputation: 95918
If I understand the code correctly, PDFKitten looks for fonts only in the /Font entry of the /Resources dictionary of the page. At least that's my interpretation of the method fontCollectionWithPage of Scanner the result of which is queried by setFont in pdfScannerCallbacks to set the current font object.
Furthermore there is no callback for the Do operator (i.e. the operator used to inject the contents of a XObject resource into the page content). Unless CGPDFScannerScan interprets this operator under the hood, the content of included XObjects is not scanned at all. This would match your observation that the text setting operator callbacks never get called.
Your file mundo1.pdf, though, does not have any immediate /Font entry in the /Resources dictionaries of its pages. Instead all the actual content of each page is wrapped into a single /XObject resources respectively. These XObjects in turn have their own /Resources dictionaries which contain a /Font entry defining the fonts used for the respective page.
Thus, PDFKitten does not know anything about the fonts used in your file, especially about their encodings, and so cannot extract the text from the PDF contents. Maybe it does not even get to see the PDF contents to interpret.
I would, therefore, propose you post this issue on the PDFKitten issue management site.
By the way, this PDF construct is completely according to the PDF spec. Nonetheless it looks like a non-adequate use of the iText library. The author of the software using iText like that should review his code and start using better suited classes of the iText library.
Upvotes: 1