Reputation: 5832
We have hundreds of PDF files on a server. Some of them contain searchable text and others do not.
I was asked to find out which are searchable and which are not.
Does anybody know of a way to read in a bunch of PDFs and determine if that PDF document contains text that is searchable/selectable or if the pdf only contains non-selectable/searchable text which needs to be OCRd?
I don't even need to actually read in the text; I just need to be able to detect possibly by tags or keywords, something that suggests that there are fonts or something like that in the raw data.
Are there tags in a searchable PDF that make it easy to detect?
Thanks
Upvotes: 3
Views: 3027
Reputation: 10433
You could modify this code(pdf2text) to suit your purposes, I believe. Or this answer might get you to the right spot as well.
Upvotes: 1