Reputation: 1
Is it possible to extract an image from a jpeg, png or tiff file? NOT PDF! Suppose I have a file containing both text and images in jpeg format (so it's basically a picture); I want to be able to extract the image only programmatically (preferably using Java). If anyone knows useful libraries please let me know. I have already tried AspriseOCR and tesseract-ocr, they have been successful at extracting text only (obviously). Thank you.
Upvotes: 0
Views: 3771
Reputation: 1394
If you are interested in an out-of-box product that could do this via black-box processing with minimal non-programming configuration (since you tried other products), then ABBYY FlexiCapture can do it. It can be configured to look for dynamic sizes of pictures/objects in loosely defined areas, or anywhere on the page, with full control over search logic. I used it once to extract lines of specific shape and thickness to separate chapters of a book, where each line indicated a new chapter, and could be anywhere on the page.
Upvotes: 0
Reputation: 9935
Try :
int startProintX = xxx;
int startProintY = xxx;
int endProintX = xxx;
int endProintY = xxx;
BufferedImage image = ImageIO.read(new File("D:/temp/test.jpg"));
BufferedImage out = image.getSubimage(startProintX, startProintY, endProintX, endProintY);
ImageIO.write(out, "jpg", new File("D:/temp/result.jpg"));
These point are region of image you want to extract.
Extract image from pdf file
I suggest to change your post tile. You can use pdfbox
or iText
api. The below example to extract the all of the image from pdf file.
There might be some resource for you. If there are a lot of image in pdf, may be occur java.lang.OutOfMemoryError
.
Download pdfbox.xx.jar
here.
import java.io.File;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.PDFBox;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
import org.jdom.Document;
public class ExtractImagesFromPDF {
public static void main(String[] args) throws Exception {
PDDocument document = PDDocument.load(new File("D:/temp/test.pdf"));
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while(iter.hasNext()) {
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
Map images = resources.getImages();
if( images != null ) {
Iterator imageIter = images.keySet().iterator();
while(imageIter.hasNext()) {
String key = (String)imageIter.next();
System.out.println("Key : " + key);
PDXObjectImage image = (PDXObjectImage)images.get(key);
File file = new File("D:/temp/" + key + "." + image.getSuffix());
image.write2file(file);
}
}
}
}
}
Extract specific image from pdf file
To extract specific image, you have to know index of page
and index of image
of that page. Otherwise, you cannot extract.
The following example program extract first image
of first page
.
int targetPage = 0;
PDPage firstPage = (PDPage)document.getDocumentCatalog().getAllPages().get(targetPage);
PDResources resources = firstPage.getResources();
Map images = resources.getImages();
int targetImage = 0;
String imageKey = "Im" + targetImage;
PDXObjectImage image = (PDXObjectImage)images.get(imageKey);
File file = new File("D:/temp/" + imageKey + "." + image.getSuffix());
image.write2file(file);
Upvotes: 1