RE6
RE6

Reputation: 2724

How to read doc and docx in java

First you should know I have looked into many questions and none of them helped me. I want to be able to read doc and docx documents (when I say read I mean the simplest thing, reading TEXT ONLY). I saw some posts about poi and scratchpad but I couldn't make it work properly, and most of the times eclipse couldn't even build my project...

Can someone give me a code sample for doc and docx and give me the names (or links) of all the jars I need to use?

Thanks!

Basically this is the code:

try {
    if (getFileExtention(path).equals("docx")) {
        FileInputStream fis = new FileInputStream(path);
        XWPFWordExtractor oleTextExtractor =
            new XWPFWordExtractor(new XWPFDocument(fis));
        return oleTextExtractor.getText();
    } else if (getFileExtention(path).equals("doc")) {
        FileInputStream fis = new FileInputStream(path);
        WordExtractor we = new WordExtractor(fis);
        return we.getText();
    }
} catch (FileNotFoundException e) {
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
}


return "";

I have the following jars:

dom4j-1.6.1.jar

poi-3.8-20120326.jar

poi-ooxml-3.8-20120326.jar

poi-scratchpad-3.8-20120326.jar

xmlbeans-xmlpublic-2.4.0.jar

I have the following problems:

This one occurs many times during build

> [2012-07-05 14:12:53 - iCards] Dx warning: Ignoring InnerClasses
> attribute for an anonymous inner class
> (org.dom4j.xpath.DefaultXPath$1) that doesn't come with an associated
> EnclosingMethod attribute. This class was probably produced by a
> compiler that did not target the modern .class file format. The
> recommended solution is to recompile the class from source, using an
> up-to-date compiler and without specifying any "-target" type options.
> The consequence of ignoring this warning is that reflective operations
> on this class will incorrectly indicate that it is *not* an inner
> class.

Another one: (When trying to read docx)

> 07-05 14:17:13.245: W/System.err(4339): java.io.IOException: read
> failed: EBADF (Bad file number) 07-05 14:17:13.255:
> W/System.err(4339):   at libcore.io.IoBridge.read(IoBridge.java:432)
> 07-05 14:17:13.260: W/System.err(4339):   at
> java.io.FileInputStream.read(FileInputStream.java:179) 07-05
> 14:17:13.265: W/System.err(4339):     at
> java.io.PushbackInputStream.read(PushbackInputStream.java:196) 07-05
> 14:17:13.270: W/System.err(4339):     at
> libcore.io.Streams.readFully(Streams.java:81) 07-05 14:17:13.275:
> W/System.err(4339):   at
> java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:230)
> 07-05 14:17:13.280: W/System.err(4339):   at
> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:51)
> 07-05 14:17:13.285: W/System.err(4339):   at
> org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:83)
> 07-05 14:17:13.290: W/System.err(4339):   at
> org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:228)
> 07-05 14:17:13.295: W/System.err(4339):   at
> org.apache.poi.util.PackageHelper.open(PackageHelper.java:39) 07-05
> 14:17:13.300: W/System.err(4339):     at
> org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:120)
> 07-05 14:17:13.305: W/System.err(4339):   at
> com.ronEven.iCards.AddRemove.loadFile(AddRemove.java:504) 07-05
> 14:17:13.310: W/System.err(4339):     at
> com.ronEven.iCards.AddRemove.showDoc(AddRemove.java:495) 07-05
> 14:17:13.315: W/System.err(4339):     at
> com.ronEven.iCards.AddRemove.setFilePath(AddRemove.java:492) 07-05
> 14:17:13.320: W/System.err(4339):     at
> com.ronEven.iCards.FileDialog$1.onClick(FileDialog.java:177) 07-05
> 14:17:13.325: W/System.err(4339):     at
> android.view.View.performClick(View.java:3591) 07-05 14:17:13.330:
> W/System.err(4339):   at
> android.view.View$PerformClick.run(View.java:14263) 07-05
> 14:17:13.335: W/System.err(4339):     at
> android.os.Handler.handleCallback(Handler.java:605) 07-05
> 14:17:13.340: W/System.err(4339):     at
> android.os.Handler.dispatchMessage(Handler.java:92) 07-05
> 14:17:13.345: W/System.err(4339):     at
> android.os.Looper.loop(Looper.java:137) 07-05 14:17:13.345:
> W/System.err(4339):   at
> android.app.ActivityThread.main(ActivityThread.java:4507) 07-05
> 14:17:13.345: W/System.err(4339):     at
> java.lang.reflect.Method.invokeNative(Native Method) 07-05
> 14:17:13.350: W/System.err(4339):     at
> java.lang.reflect.Method.invoke(Method.java:511) 07-05 14:17:13.350:
> W/System.err(4339):   at
> com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:790)
> 07-05 14:17:13.350: W/System.err(4339):   at
> com.android.internal.os.ZygoteInit.main(ZygoteInit.java:557) 07-05
> 14:17:13.350: W/System.err(4339):     at
> dalvik.system.NativeStart.main(Native Method) 07-05 14:17:13.355:
> W/System.err(4339): Caused by: libcore.io.ErrnoException: read failed:
> EBADF (Bad file number) 07-05 14:17:13.360: W/System.err(4339):   at
> libcore.io.Posix.readBytes(Native Method) 07-05 14:17:13.360:
> W/System.err(4339):   at libcore.io.Posix.read(Posix.java:118) 07-05
> 14:17:13.360: W/System.err(4339):     at
> libcore.io.BlockGuardOs.read(BlockGuardOs.java:149) 07-05
> 14:17:13.360: W/System.err(4339):     at
> libcore.io.IoBridge.read(IoBridge.java:422) 07-05 14:17:13.365:
> W/System.err(4339):   ... 24 more

And last one when trying to read doc

    07-05 14:17:37.015: W/System.err(4339): java.io.IOException: read failed: EBADF (Bad file number)
07-05 14:17:37.020: W/System.err(4339):     at libcore.io.IoBridge.read(IoBridge.java:432)
07-05 14:17:37.025: W/System.err(4339):     at java.io.FileInputStream.read(FileInputStream.java:179)
07-05 14:17:37.055: W/System.err(4339):     at java.io.PushbackInputStream.read(PushbackInputStream.java:196)
07-05 14:17:37.055: W/System.err(4339):     at java.io.InputStream.read(InputStream.java:163)
07-05 14:17:37.060: W/System.err(4339):     at org.apache.poi.hwpf.HWPFDocumentCore.verifyAndBuildPOIFS(HWPFDocumentCore.java:95)
07-05 14:17:37.065: W/System.err(4339):     at org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:53)
07-05 14:17:37.070: W/System.err(4339):     at com.ronEven.iCards.AddRemove.loadFile(AddRemove.java:509)
07-05 14:17:37.075: W/System.err(4339):     at com.ronEven.iCards.AddRemove.showDoc(AddRemove.java:495)
07-05 14:17:37.085: W/System.err(4339):     at com.ronEven.iCards.AddRemove.setFilePath(AddRemove.java:492)
07-05 14:17:37.090: W/System.err(4339):     at com.ronEven.iCards.FileDialog$1.onClick(FileDialog.java:177)
07-05 14:17:37.095: W/System.err(4339):     at android.view.View.performClick(View.java:3591)
07-05 14:17:37.100: W/System.err(4339):     at android.view.View$PerformClick.run(View.java:14263)
07-05 14:17:37.105: W/System.err(4339):     at android.os.Handler.handleCallback(Handler.java:605)
07-05 14:17:37.110: W/System.err(4339):     at android.os.Handler.dispatchMessage(Handler.java:92)
07-05 14:17:37.115: W/System.err(4339):     at android.os.Looper.loop(Looper.java:137)
07-05 14:17:37.120: W/System.err(4339):     at android.app.ActivityThread.main(ActivityThread.java:4507)
07-05 14:17:37.120: W/System.err(4339):     at java.lang.reflect.Method.invokeNative(Native Method)
07-05 14:17:37.125: W/System.err(4339):     at java.lang.reflect.Method.invoke(Method.java:511)
07-05 14:17:37.125: W/System.err(4339):     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:790)
07-05 14:17:37.130: W/System.err(4339):     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:557)
07-05 14:17:37.130: W/System.err(4339):     at dalvik.system.NativeStart.main(Native Method)
07-05 14:17:37.130: W/System.err(4339): Caused by: libcore.io.ErrnoException: read failed: EBADF (Bad file number)
07-05 14:17:37.150: W/System.err(4339):     at libcore.io.Posix.readBytes(Native Method)
07-05 14:17:37.160: W/System.err(4339):     at libcore.io.Posix.read(Posix.java:118)
07-05 14:17:37.160: W/System.err(4339):     at libcore.io.BlockGuardOs.read(BlockGuardOs.java:149)
07-05 14:17:37.160: W/System.err(4339):     at libcore.io.IoBridge.read(IoBridge.java:422)
07-05 14:17:37.165: W/System.err(4339):     ... 20 more

Upvotes: 2

Views: 3988

Answers (3)

sattik
sattik

Reputation: 81

  • For reading DOCX documents we can use XWPFWordExtractor with XWPFDocument.
  • For reading DOC documents we can use WordExtractor with HWPFDocument.

You got the code for DOCX documents right:

XWPFWordExtractor oleTextExtractor = new XWPFWordExtractor(new XWPFDocument(fis));

But HWPFDocument is missing from your code for DOC documents. Just change this line:

WordExtractor we = new WordExtractor(fis);

into this:

WordExtractor we = new WordExtractor(new HWPFDocument(fis));

As regards the jar files, only poi-ooxml-schemas-3.8-20120326.jar seems to be missing from your Build Path.

Upvotes: 0

jspboix
jspboix

Reputation: 776

Tika supports Microsoft Office format as well as many others formats, it provides you with a common interface for all the formats as well as hiding the complexity of maintaining and learning how to use lots of different libraries. It is as easy as calling this function. You could also use the Office Parser and OOXMLParser directly.

Upvotes: 3

cl-r
cl-r

Reputation: 1264

You have also very powerful application like LibreOffice SDK (or OpenOffice 3) where you can read and manage documents (like docx) and save them in .txt format.

Upvotes: 0

Related Questions