NedStarkOfWinterfell
NedStarkOfWinterfell

Reputation: 5203

How to read old word doc file metadata

Suppose I want to import a word file with doc extension into my HTML document, along with the metadata, and display it in a div accordingly. So all existing stuff in the doc file, like texts in varied formats (bold, italics, different size, letter spacing, line-height, overline, unerline..), images (both their positions and sizes), graphs, charts (the JSP will generate the necessary graphics to provide a similar looking graph or chart. It needs only the data), lists, etc.

So is there any way to do this? Is there any standardized Word API which will give us this data? Or any JSP library that can do it? If not, then what do I need to know and do to get this?

Upvotes: 1

Views: 896

Answers (2)

Christophe Roussy
Christophe Roussy

Reputation: 17049

And 5 years later, the answer:

NOTE: this code works for old word 'doc' files only (not docx), Apache POI can also handle docx but you must use another API.

Using Apache POI, maven dependencies:

<!-- https://mvnrepository.com/artifact/org.apache.poi/poi -->
<dependency>
  <groupId>org.apache.poi</groupId>
  <artifactId>poi</artifactId>
  <version>3.17</version>
</dependency>

And here is the code:

  ...
  import org.apache.poi.poifs.filesystem.DirectoryEntry;
  import org.apache.poi.poifs.filesystem.DocumentEntry;
  import org.apache.poi.poifs.filesystem.DocumentInputStream;
  import org.apache.poi.poifs.filesystem.POIFSFileSystem;

  public static void main(final String[] args) throws FileNotFoundException, IOException, NoPropertySetStreamException,
                  MarkUnsupportedException, UnexpectedPropertySetTypeException {
      try (final FileInputStream fs = new FileInputStream("src/test/word_template.doc");
        final POIFSFileSystem poifs = new POIFSFileSystem(fs)) {
        final DirectoryEntry dir = poifs.getRoot();
        final DocumentEntry siEntry = (DocumentEntry) dir.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
        try (final DocumentInputStream dis = new DocumentInputStream(siEntry)) {
          final PropertySet ps = new PropertySet(dis);
          final SummaryInformation si = new SummaryInformation(ps);
          // Read word doc (not docx) metadata.
          System.out.println(si.getLastAuthor());
          System.out.println(si.getAuthor());
          System.out.println(si.getKeywords());
          System.out.println(si.getSubject());
          // ...
        }
      }
    }

To read the text content you will need additional dependencies:

<dependency>
  <!-- Required for HWPFDocument -->
  <groupId>org.apache.poi</groupId>
  <artifactId>poi-scratchpad</artifactId>
  <version>3.17</version>
</dependency>

Code:

try (final HWPFDocument doc = new HWPFDocument(fs)) {
  return doc.getText().toString();
}

Upvotes: 0

g051051
g051051

Reputation: 1041

Check out the Apache POI project: http://poi.apache.org/text-extraction.html as well as Apache Tika: http://tika.apache.org/

Upvotes: 1

Related Questions