Reputation: 133

GET TEXT FROM IMAGE EMBEDDED IN A .docx FILE USING TIKA

I've been working on Text Extractor that works on .docx file using Tika. And it is working file for basic text and text in tables and textboxes, but it fails for images.

How do I get text from Image, tesseract along with tika can be used to get text from an image alone but for that I would need to extract out the image from document. How do I do this?

Kindly help if anybody has worked upon something like this.

This the code that works fine for text, textbox and tables,but not for images:

public class BasicDocumentExtractor {
public static void main(final String[] args) throws IOException,SAXException, TikaException { 

        //detecting the file type 
        BodyContentHandler handler = new BodyContentHandler(); 
        Metadata metadata = new Metadata(); 

        FileInputStream inputstream=new FileInputStream(new File("D:\\Nidhi\\sw\\ws\\Hello.docx")); 
        ParseContext pcontext=new ParseContext(); 

        //OOXml parser 
        OOXMLParser msofficeparser=new OOXMLParser (); 
        msofficeparser.parse(inputstream, handler,metadata,pcontext); 
        System.out.println("Contents of the document:" +handler.toString()); 

        /*System.out.println("Metadata of the document:"); 
        String[] metadataNames = metadata.names(); 

        for(String name : metadataNames){ 
            System.out.println(name + ": " + metadata.get(name)); 
        }*/
}

}

Upvotes: 0

Answers (2)

Gagravarr

Reputation: 48346

You need to enable recursion in Tika in order to get the embedded images. The simplest way is normally just to use the RecursiveParserWrapper to do it for you.

If you use it, your code would instead be roughly

    BodyContentHandler handler = new BodyContentHandler(); 
    Metadata metadata = new Metadata(); 

    TikaInputStream input = TikaInputStream.get(new File("D:\\Nidhi\\sw\\ws\\Hello.docx")); 

    Parser wrapped = new AutoDetectParser();
    RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped,
            new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, 60));

    wrapper.parse(stream, handler, metadata, context);

    // Get metadata from children
    List<Metadata> list = wrapper.getMetadata();
    // Get metadata from main document
    System.out.println("Main doc name is " + metadata.get(TikaCoreProperties.TITLE));

    System.out.println("Contents of the document:" +handler.String());

Upvotes: 1

Nidhi jain

Reputation: 133

As I was trying really hard to do this since las 24hours, I figured out a way, a pretty easy one. Since, Tika is built on the top of POI, using POI this task can be efficiently executed. Also, it is not a direct solution so alomost no tutorials are available for this purpose, I hope nobody else has to face this issue in future. This is the running code that extracts all images from a .docx document:

public static void getImages() throws Exception {

    XWPFDocument doc=new XWPFDocument(new FileInputStream("D:\\Nidhi\\CDAC\\Images\\test1.docx"));

    List images=doc.getAllPictures();
    int i =0;

    while (i<images.size()) {
        XWPFPictureData pic= (XWPFPictureData) images.get(i);
        System.out.println(pic.getFileName() + "   "+ pic.getPictureType() +"  "+ pic.getData());

        FileOutputStream fos=new FileOutputStream("D:\\Nidhi\\CDAC\\Images\\b" + i+".jpg");
        fos.write(pic.getData());
        i++;
    }
}

Also, if it will work on all MS Office 2007+ files, for .doc and such files use HWPF in the exactly same manner.

Upvotes: 1

GET TEXT FROM IMAGE EMBEDDED IN A .docx FILE USING TIKA

Answers (2)

Related Questions