mathinvalidnik
mathinvalidnik

Reputation: 1600

How to parse text from MS Word document to string

I am trying to find a way to parse a word document's text to a string in my project.I have more than 600 word(.doc) files that I need to get the text content(with the new lines and tabs if possible) and assign it to a string for each one.

I've been reading stuff about the Open XML SDK but it looks quite complicated for something that looks so simple.

Upvotes: 4

Views: 19181

Answers (2)

Vadim
Vadim

Reputation: 2865

Open XML SDK is only for 2007 and newer formats and it is not trivial to use.

If performance is not an issue you could use Word Automation and have Word do this for you. It will look something like this:

var app = new Application();
var doc = app.Documents.Open(documentLocation);

string rangeText = doc.Range().Text;

doc.Save();
doc.Close();

Marshal.ReleaseComObject(doc);    
Marshal.ReleaseComObject(app);

Take a look at http://www.codeproject.com/Articles/18703/Word-2007-Automation or http://www.codeproject.com/Articles/21247/Word-Automation for more complete examples and instructions. Note that this may become a bit more tricky if your documents are move complex (footnotes, text boxes, tables...).

Another option is have word save the document as a text and then read the text file. Take a look at this - http://msdn.microsoft.com/en-us/library/microsoft.office.tools.word.document.saveas(v=vs.80).aspx

Upvotes: 5

npinti
npinti

Reputation: 52185

You could give a look at NPOI:

This project is the .NET version of POI Java project at http://poi.apache.org/. POI is an open source project which can help you read/write xls, doc, ppt files. It has a wide application.

Take a look at this previous SO thread for more information.

Upvotes: 1

Related Questions