Karthikeyan
Karthikeyan

Reputation: 526

Parse content of Word document using dot net or Java


      I was hoping someone could help with a Parsing MS Word. Basically i need to parse the content of Word document and use the resulting value to form a map. Actually Word will have content like this:

Key1: Value1
Key2: Value2
KeyKey1: Key11: Value11
         Key12: Value12
KeyKey2:
  Key21: Value21
  Key22: Value22

      The document will have either table or key-value pairs (key-key-value also). We need to identify or differentiate the key & keykey and need to parse the document and insert it into map. At present i am looking at manual parsing, which looks like too much hard-coding of values. for instance how to differentiate key1 from keykey1 and key1 from keykey2..
      Please suggest some method to parse the content of word document or libraries to parse in C# or Java.

Any help will be appreciated. Thanks in advance.

Upvotes: 0

Views: 807

Answers (3)

Sap
Sap

Reputation: 5291

The best library as of now is Apache tika to do the same. It supports multiple document types and involves writing only few lines of code. You can read this article http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika and if you ignore solr related code then it's only 5-6 lines of code to extract pdf content.

Upvotes: 1

Nirmit Shah
Nirmit Shah

Reputation: 758

You can have a look at Java API For Microsoft documents for parsing word document in Java.

Upvotes: 0

Nico Huysamen
Nico Huysamen

Reputation: 10417

Do you need to look at the content of the document? For that you can use Apache POI with Java. We use it in our application without any problems. We both read and write to/from both Word and Excel documents. The documentation is very complete and the API quite easy.

Upvotes: 1

Related Questions