Reputation: 1

extract image from image

Is it possible to extract an image from a jpeg, png or tiff file? NOT PDF! Suppose I have a file containing both text and images in jpeg format (so it's basically a picture); I want to be able to extract the image only programmatically (preferably using Java). If anyone knows useful libraries please let me know. I have already tried AspriseOCR and tesseract-ocr, they have been successful at extracting text only (obviously). Thank you.

Upvotes: 0

Answers (2)

Ilya Evdokimov

Reputation: 1394

If you are interested in an out-of-box product that could do this via black-box processing with minimal non-programming configuration (since you tried other products), then ABBYY FlexiCapture can do it. It can be configured to look for dynamic sizes of pictures/objects in loosely defined areas, or anywhere on the page, with full control over search logic. I used it once to extract lines of specific shape and thickness to separate chapters of a book, where each line indicated a new chapter, and could be anywhere on the page.

Upvotes: 0

Zaw Than oo

Reputation: 9935

Try :

int startProintX  = xxx;
int startProintY  = xxx;
int endProintX  = xxx;
int endProintY  = xxx;
BufferedImage image = ImageIO.read(new File("D:/temp/test.jpg"));   
BufferedImage out = image.getSubimage(startProintX, startProintY, endProintX, endProintY);
ImageIO.write(out, "jpg", new File("D:/temp/result.jpg"));

These point are region of image you want to extract.

Extract image from pdf file

I suggest to change your post tile. You can use pdfbox or iText api. The below example to extract the all of the image from pdf file. There might be some resource for you. If there are a lot of image in pdf, may be occur java.lang.OutOfMemoryError.

Download pdfbox.xx.jar here.

import java.io.File;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.apache.pdfbox.PDFBox;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
import org.jdom.Document;

public class ExtractImagesFromPDF {
    public static void main(String[] args) throws Exception {
        PDDocument document = PDDocument.load(new File("D:/temp/test.pdf"));
         List pages = document.getDocumentCatalog().getAllPages();
         Iterator iter = pages.iterator();
         while(iter.hasNext()) {
             PDPage page = (PDPage)iter.next();
             PDResources resources = page.getResources();
             Map images = resources.getImages();
             if( images != null ) {
                 Iterator imageIter = images.keySet().iterator();
                 while(imageIter.hasNext()) {
                     String key = (String)imageIter.next();
                     System.out.println("Key : " + key);
                     PDXObjectImage image = (PDXObjectImage)images.get(key);
                     File file = new File("D:/temp/" +  key + "." + image.getSuffix());
                     image.write2file(file);
                 }
             }
         }
    }
}

Extract specific image from pdf file

To extract specific image, you have to know index of page and index of image of that page. Otherwise, you cannot extract.

The following example program extract first image of first page.

 int targetPage = 0;
 PDPage firstPage = (PDPage)document.getDocumentCatalog().getAllPages().get(targetPage);
 PDResources resources = firstPage.getResources();
 Map images = resources.getImages();
 int targetImage = 0;
 String imageKey = "Im" + targetImage; 
 PDXObjectImage image = (PDXObjectImage)images.get(imageKey);
 File file = new File("D:/temp/" +  imageKey + "." + image.getSuffix());
 image.write2file(file);

Upvotes: 1

extract image from image

Answers (2)

Related Questions