Rijo Joseph
Rijo Joseph

Reputation: 1405

Java reading .doc file using POI

Hi i am trying to read text from doc and docx file, for doc files i am doing this

package test;
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class ReadFile {
public static void main(String[] args) {
        File file = null;
        WordExtractor extractor = null;
        try {

            file = new File("C:\\Users\\rijo\\Downloads\\r.doc");
            FileInputStream fis = new FileInputStream(file.getAbsolutePath());
            HWPFDocument document = new HWPFDocument(fis);
            extractor = new WordExtractor(document);
            String fileData = extractor.getText();
            System.out.println(fileData);
        } catch (Exception exep) {
        }
    }
}

But this gives me an org/apache/poi/OldFileFormatException exception.

Any idea how to fix this?

Also I need to read Docx and PDF files ? any good way to read all type of files?

Upvotes: 3

Views: 15324

Answers (3)

Darius Miliauskas
Darius Miliauskas

Reputation: 3514

I do not know why you are using WordExtractor just to get text from .doc. For me it was enough to use one method:

import org.apache.poi.hwpf.HWPFDocument;
...
File fin = new File(yourFilePath);
FileInputStream fis = new FileInputStream(fin);
HWPFDocument doc = new HWPFDocument(fis);
String text = doc.getDocumentText();
System.out.println(text);
...

To work with .pdf use another Apache: pdfbox.

Upvotes: 0

Levenal
Levenal

Reputation: 3806

Using the following jars (In case version numbers are playing a role here):

dom4j-1.7-20060614
poi-3.9-20121203
poi-ooxml-3.9-20121203
poi-ooxml-schemas-3.9-20121203
poi-scratchpad-3.9-20121203
xmlbeans-2.4.0

I typed this up:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class SO {
public static void main(String[] args){

            //Alternate between the two to check what works.
    //String FilePath = "D:\\Users\\username\\Desktop\\Doc1.docx";
    String FilePath = "D:\\Users\\username\\Desktop\\Bob.doc";
    FileInputStream fis;

    if(FilePath.substring(FilePath.length() -1).equals("x")){ //is a docx
    try {
        fis = new FileInputStream(new File(FilePath));
        XWPFDocument doc = new XWPFDocument(fis);
        XWPFWordExtractor extract = new XWPFWordExtractor(doc);
        System.out.println(extract.getText());
    } catch (IOException e) {

        e.printStackTrace();
    }
    } else { //is not a docx
        try {
            fis = new FileInputStream(new File(FilePath));
            HWPFDocument doc = new HWPFDocument(fis);
            WordExtractor extractor = new WordExtractor(doc);
            System.out.println(extractor.getText());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
  }
}

this allowed me to read text from both a .docx and .doc respectively. If this doesn't work on your PC you may well have either an issue with the external jars you are using.

Give it a go though :) Good luck!

Upvotes: 7

Rahul
Rahul

Reputation: 45060

If you look at the javadocs of OldFileFormatException , you can see the reason for that

Base class of all the exceptions that POI throws in the event that it's given a file that's older than currently supported.

This means that the r.doc you're using is not supported by the HWPFDocument. May be it supports the latest format(docx has also been there for quite a long time now. Not sure if ApachePOI supports doc format in the HWPFDocument).

Upvotes: 1

Related Questions