Reputation: 149

How to translate .doc to string?

Is there a way to translate a Microsoft word document to a string without using the Microsoft COM component? I am hoping there is some other way to deal with all of the excess markup.

EDIT 12/13/13: We didn't want to reference the com component because if the customer didn't have the exact same version of office installed it wouldn't work. Luckily Microsoft has made the 2013 word.interop.dll backward compatible. Now we don't have to worry about this restriction. Once referencing the dll we can do the following:

/// <summary>Gets the content of the word document</summary>
/// <param name="filePath">The path to the word document file</param>
/// <returns>The content of the document</returns>
public string ExtractText(string filePath)
{
    if (string.IsNullOrEmpty(filePath))
        throw new ArgumentNullException("filePath", "Input file path not specified.");

    if (!File.Exists(filePath))
        throw new FileNotFoundException("Input file not found at specified path.", "filepath");

    var resultText = string.Empty;
    Application wordApp = null;

    try
    {
        wordApp = new Application();
        var doc = wordApp.Documents.Open(filePath, Type.Missing, true);
        if (doc != null)
        {
            if (doc.Content != null && !string.IsNullOrEmpty(doc.Content.Text))
              resultText = doc.Content.Text.Normalize();

            doc.Close();
        }
    }
    finally
    {
        if (wordApp != null)
            wordApp.Quit(false, Type.Missing, false);
    }

    return resultText;
}

Upvotes: 4

Answers (3)

Mario Z

Reputation: 4381

If you are referring to an older DOC file format then that is quite an issue because it is a MS specified binary file format and I must say I totally agree with the RQDQ's comment.

But if you are referring to a DOCX file format then you can achieve this without MS COM component or any other component, just pure .NET.

Check the following solutions:

http://www.codeproject.com/Articles/20529/Using-DocxToText-to-Extract-Text-from-DOCX-Files http://www.dotnetspark.com/kb/Content.aspx?id=5633

Upvotes: 0

Yahia

Reputation: 70379

You will need to use some library to achieve what you want:

MS provides the OpenXML SDK V 2.0 (free, DOCX only)
Aspose.Words (commercial, DOC and DOCX)

IF you have lots of time on your hands then writing a .DOC parser might be thinkable - the .DOC spec can be found here.

BTW: Office Interop is not supported by MS in server-like scenarios (like ASP.NET or Windows Service or similar) - see http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757#kb2 !

Upvotes: 2

Olaf

Reputation: 10247

Assuming you mean to extract the text content of a doc file, there are a few command line tools as well as commercial libraries. A rather old tool that we once used to search doc (not docx) files (in combination with the search engine sphider) was catdoc (also here) which is a DOS rather than a Windows tool but nonetheless worked for us as long as we met the prerequisites (file name format 8.3).

A commercial product doc2txt if you can afford $29.

For the newer docx format, you can use the Perl based tool docx2txt.

Of course, if you want to run those tools from c#, you need to trigger an external Process - check here for a solid explanation.

A rather expensive, but very powerful tool to access doc and docx content is Spire.doc, but it does a lot more than you need. It is more convenient to use as it is a .NET library.

Upvotes: 1

How to translate .doc to string?

Answers (3)

Related Questions