Reputation: 64
I am Working on Pdfs to Excel conversion using docparser. But docparser is unable to process scanned pdfs properly. So I need to seperate the scanned pdfs from normal pdfs and only want to process normal pdfs through docparser(i.e API call). Is there exit some to way to identify the pdf type(Scanned or normal) programmatically so that I could work further? Please help if anyone knows how to tackle this problem.....
Upvotes: 1
Views: 3266
Reputation: 64
Finally, I found a solution to my question.But not a standard one(I THINK SO). Thanks to the people who commented and provide some help.
Using Pdfbox library we can extract pages of scanned pdf and will compare each page to the instance of an image object(PDImageXObject),if it comes true , the page will be count as an image and we can count those images.If images are equal to number of pages in pdf. We will say it is a scanned pdf.
here is the code...
public static String testPdf(String filename) throws IOException
{
String s = "";
int g = 0;
int gg = 0;
PDDocument doc = PDDocument.load(new File(filename));
gg = doc.getNumberOfPages();
for(PDPage page:doc.getPages())
{
PDResources resource = page.getResources();
for(COSName xObjectName:resource.getXObjectNames())
{
PDXObject xObject = resource.getXObject(xObjectName);
if (xObject instanceof PDImageXObject)
{
((PDImageXObject) xObject).getImage();
g++;
}
}
}
doc.close();
if(g==gg) // pdf pages if equal to the images
{
return "Scanned pdf";
}
else
{
return "Searchable pdf";
}
}
Upvotes: 2