slt
slt

Reputation: 99

Apache Tika not returning text for embedding images in Microsoft Word documents (.doc, .docx)

I am using Apache Tika to extract text from Microsoft Word (.doc & .docx files). Below is my code

public String extractTextFromWord(MultipartFile wordFile) throws IOException, TikaException, SAXException {
    if (wordFile.isEmpty()) {
        return null;
    }
    // media type
    Tika tika = new Tika();
    String mediaType = tika.detect(wordFile.getInputStream());
    if (Set.of(
            "application/x-tika-msoffice",
            "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    ).contains(mediaType)) {
        // extract text
        Parser parser = new AutoDetectParser();
        ContentHandler contentHandler = new BodyContentHandler(-1);
        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, new TesseractOCRConfig());
        parseContext.set(Parser.class, parser);

        parser.parse(wordFile.getInputStream(), contentHandler, new Metadata(), parseContext);
        return contentHandler.toString();
    }
    throw new RuntimeException("File is not a valid Microsoft Word document!");
}

My code works fine for text in document, but not for text in embedding images. It is returning something like "_2147483647.unknown" for images in .doc document, and "image1.jpeg, image2.jpg, image3.jpg" for images in .docx document (both of the files has same content, including 3 images with text). I already have TesseractOCR installed. What should I do next?

Upvotes: 0

Views: 163

Answers (1)

marek.kapowicki
marek.kapowicki

Reputation: 732

for pdf file You need to configure parser to either take (or not) the embaded images

PDFParserConfig pdfParserConfig = new PDFParserConfig();
pdfParserConfig.setExtractInlineImages(true);
pdfParserConfig.setOcrStrategy(OCR_ONLY);

You can define how to extract text from given pdf doc (just pure extraction without ocr, or ocr)

IMO for doc file the tika just does the simple extraction and ignores the part that should be ocred

I m not sure is there any way to do the full extraction from given doc file. You can try to extract the images GET TEXT FROM IMAGE EMBEDDED IN A .docx FILE USING TIKA

Upvotes: 0

Related Questions