Reputation: 99
I am using Apache Tika to extract text from Microsoft Word (.doc & .docx files). Below is my code
public String extractTextFromWord(MultipartFile wordFile) throws IOException, TikaException, SAXException {
if (wordFile.isEmpty()) {
return null;
}
// media type
Tika tika = new Tika();
String mediaType = tika.detect(wordFile.getInputStream());
if (Set.of(
"application/x-tika-msoffice",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
).contains(mediaType)) {
// extract text
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler(-1);
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, new TesseractOCRConfig());
parseContext.set(Parser.class, parser);
parser.parse(wordFile.getInputStream(), contentHandler, new Metadata(), parseContext);
return contentHandler.toString();
}
throw new RuntimeException("File is not a valid Microsoft Word document!");
}
My code works fine for text in document, but not for text in embedding images. It is returning something like "_2147483647.unknown" for images in .doc document, and "image1.jpeg, image2.jpg, image3.jpg" for images in .docx document (both of the files has same content, including 3 images with text). I already have TesseractOCR installed. What should I do next?
Upvotes: 0
Views: 163
Reputation: 732
for pdf file You need to configure parser to either take (or not) the embaded images
PDFParserConfig pdfParserConfig = new PDFParserConfig();
pdfParserConfig.setExtractInlineImages(true);
pdfParserConfig.setOcrStrategy(OCR_ONLY);
You can define how to extract text from given pdf doc (just pure extraction without ocr, or ocr)
IMO for doc file the tika just does the simple extraction and ignores the part that should be ocred
I m not sure is there any way to do the full extraction from given doc file. You can try to extract the images GET TEXT FROM IMAGE EMBEDDED IN A .docx FILE USING TIKA
Upvotes: 0