Reputation: 526
I was hoping someone could help with a Parsing MS Word. Basically i need to parse the content of Word document and use the resulting value to form a map. Actually Word will have content like this:
Key1: Value1
Key2: Value2
KeyKey1: Key11: Value11
Key12: Value12
KeyKey2:
Key21: Value21
Key22: Value22
The document will have either table or key-value pairs (key-key-value also). We need to identify or differentiate the key & keykey and need to parse the document and insert it into map. At present i am looking at manual parsing, which looks like too much hard-coding of values. for instance how to differentiate key1 from keykey1 and key1 from keykey2..
Please suggest some method to parse the content of word document or libraries to parse in C# or Java.
Any help will be appreciated. Thanks in advance.
Upvotes: 0
Views: 807
Reputation: 5291
The best library as of now is Apache tika to do the same. It supports multiple document types and involves writing only few lines of code. You can read this article http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika and if you ignore solr related code then it's only 5-6 lines of code to extract pdf content.
Upvotes: 1
Reputation: 758
You can have a look at Java API For Microsoft documents for parsing word document in Java.
Upvotes: 0
Reputation: 10417
Do you need to look at the content of the document? For that you can use Apache POI with Java. We use it in our application without any problems. We both read and write to/from both Word and Excel documents. The documentation is very complete and the API quite easy.
Upvotes: 1