Shanthini
Shanthini

Reputation: 197

How to convert a word document to a text file in c# without using microsoft.office.interop?

I have plenty of different versions of word documents which have to be converted to text files.
I hope this link brings you right way How to extract text from Word files using C#? I want to read the content of the word document and remove all the formats(just have words in text files). I have done by using microsoft.office.interop(here, always instantiate a Word on the client) which is not recommended. So I am trying to create a c# project which should convert word to text automatically. Can anyone suggest me any 3rd party tool which should be efficient open source or reasonable price for all the versions of word to text file conversion in c#?

With Regards, Shanthini

Upvotes: 1

Views: 5464

Answers (2)

elshev
elshev

Reputation: 1563

If you don't want Interop, you can use NPOI. It's a matured open source project to work with Word and Excel files.

Please note that Word file can have a complex structure like nesting tables or joined/splitted cells. That's why, I think, NPOI doesn't have an explicit SaveAsText() method. But if you need only text from paragraphs or tables, you can easily extract it like this (.NET 6 example):

public static IEnumerable<string> WordFileToText(string wordFilePath)
{
    using var fileStream = File.OpenRead(wordFilePath);
    using var doc = new XWPFDocument(fileStream);
    var result = WordFileToText(doc);
    return result;
}

private static IEnumerable<string> WordFileToText(XWPFDocument doc)
{
    var result = new List<string>();
    foreach (var bodyElement in doc.BodyElements)
    {
        if (bodyElement is XWPFParagraph paragraph)
        {
            result.Add(paragraph.Text);
            continue;
        }
        if (bodyElement is not XWPFTable table)
            continue;

        foreach (var row in table.Rows)
        {
            var tableLine = new StringBuilder();
            foreach (var cell in row.GetTableCells())
            {
                foreach (var cellParagraph in cell.Paragraphs)
                {
                    tableLine.Append(cellParagraph.Text);
                    tableLine.Append("| ");
                }
            }
            result.Add(tableLine.ToString());
        }
    }
    return result;
}

Upvotes: 0

Shanthini
Shanthini

Reputation: 197

Finally I found solution which perfectly works for me at the moment. I haven't test with 10000 documents. Here you go., http://sourceforge.net/projects/word-reader/?source=dlp Comments and suggestions are expecting about this solution...

Thank you, Shanthini

Upvotes: 1

Related Questions