Apache Tika not returning text for embedding images in Microsoft Word documents (.doc, .docx)

Question

I am using Apache Tika to extract text from Microsoft Word (.doc & .docx files). Below is my code

public String extractTextFromWord(MultipartFile wordFile) throws IOException, TikaException, SAXException {
    if (wordFile.isEmpty()) {
        return null;
    }
    // media type
    Tika tika = new Tika();
    String mediaType = tika.detect(wordFile.getInputStream());
    if (Set.of(
            "application/x-tika-msoffice",
            "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    ).contains(mediaType)) {
        // extract text
        Parser parser = new AutoDetectParser();
        ContentHandler contentHandler = new BodyContentHandler(-1);
        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, new TesseractOCRConfig());
        parseContext.set(Parser.class, parser);

        parser.parse(wordFile.getInputStream(), contentHandler, new Metadata(), parseContext);
        return contentHandler.toString();
    }
    throw new RuntimeException("File is not a valid Microsoft Word document!");
}

My code works fine for text in document, but not for text in embedding images. It is returning something like "_2147483647.unknown" for images in .doc document, and "image1.jpeg, image2.jpg, image3.jpg" for images in .docx document (both of the files has same content, including 3 images with text). I already have TesseractOCR installed. What should I do next?

Apache Tika not returning text for embedding images in Microsoft Word documents (.doc, .docx)

Answers (1)

Related Questions