Reputation: 323
I want to extract text from .doc files, I use this code
Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
object miss = System.Reflection.Missing.Value;
object path = FileToSave_path + FileNameToSave + ".doc";
object readOnly = true;
Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);
string totaltext = "";
for (int p = 0; p < docs.Paragraphs.Count; p++)
{
ExtractedHTML += " \r\n " + docs.Paragraphs[p + 1].Range.Text.ToString();
}
docs.Close();
word.Quit();
the problem is that this code is very slow, I have many .doc files with many paragraphs any other way to extract from .doc fast ?
Upvotes: 1
Views: 4347
Reputation: 22651
It is so slow because you need to 'start' Word every time (this happens underwater, but there are still some startup routines which it needs to perform). So it helps if you close only the document and not Word itself (with word.Quit();
).
You can also look into third party libraries which can open .doc files without the help of Word. For .docx files, you can use Microsoft's own OpenXML SDK.
Upvotes: 2