read doc file very fast c#

Question

I want to extract text from .doc files, I use this code

Microsoft.Office.Interop.Word.Application word = new  Microsoft.Office.Interop.Word.Application();
object miss = System.Reflection.Missing.Value;
object path = FileToSave_path + FileNameToSave + ".doc";
object readOnly = true;
Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);
string totaltext = "";
for (int p = 0; p < docs.Paragraphs.Count; p++)
{
    ExtractedHTML += " 
 " + docs.Paragraphs[p + 1].Range.Text.ToString();
}

docs.Close();
word.Quit();

the problem is that this code is very slow, I have many .doc files with many paragraphs any other way to extract from .doc fast ?

Glorfindel · Accepted Answer

It is so slow because you need to 'start' Word every time (this happens underwater, but there are still some startup routines which it needs to perform). So it helps if you close only the document and not Word itself (with word.Quit();).

You can also look into third party libraries which can open .doc files without the help of Word. For .docx files, you can use Microsoft's own OpenXML SDK.

read doc file very fast c#

Answers (1)

Related Questions