Reputation: 525

How to find highlighted text from Word file in C# using Microsoft.Office.Interop.Word?

The question would have been simple but an extra clause added to it has proved to be a big headache for me. The catch here is that I do not need all highlighted "words" but "phrases" from the Word file. I have written the following code:

using Word = Microsoft.Office.Interop.Word;

private void button1_Click(object sender, EventArgs e)
{
    try
    {
        Word.ApplicationClass wordObject = new Word.ApplicationClass();
        wordObject.Visible = false;
        object file = "D:\\mywordfile.docx";
        object nullobject = System.Reflection.Missing.Value;
        Word.Document thisDoc = wordObject.Documents.Open(ref file, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject);
        List<string> wordHighlights = new List<string>();

        //Let myRange be some Range which has my text under consideration

        int prevStart = 0;
        int prevEnd = 0;
        int thisStart = 0;
        int thisEnd = 0;
        string tempStr = "";
        foreach (Word.Range cellWordRange in myRange.Words)
        {
            if (cellWordRange.HighlightColorIndex.ToString() == "wdNoHighlight")
            {
                continue;
            }
            else
            {
                thisStart = cellWordRange.Start;
                thisEnd = cellWordRange.End;
                string cellWordText = cellWordRange.Text.Trim();
                if (cellWordText.Length >= 1)   // valid word length, non-whitespace
                {
                    if (thisStart == prevEnd)    // If this word is contiguously highlighted with previous highlighted word
                    {
                        tempStr = String.Concat(tempStr, " "+cellWordText);  // Concatenate with previous contiguously highlighted word
                    }
                    else
                    {
                        if (tempStr.Length > 0)    // If some string has been concatenated in previous iterations
                        {
                            wordHighlights.Add(tempStr);
                        }
                        tempStr = "";
                        tempStr = cellWordText;
                    }
                }
                prevStart = thisStart;
                prevEnd = thisEnd;
            }
        }

        foreach (string highlightedString in wordHighlights)
        {
            MessageBox.Show(highlightedString);
        }
    }
    catch (Exception j)
    {
        MessageBox.Show(j.Message);
    }
}

Now consider the following text:

Le thé vert a un rôle dans la diminution du cholestérol, la combustion des graisses, la prévention du diabète et les AVC, et conjurer la démence.

Now suppose someone highlighted "du cholestérol", my code obviously selects two words "du" and "cholestérol". How can I make a continuously highlighted area appear as a single word? I mean "du cholestérol" should be returned as one entity in the List. Any logic that we scan the document char by char, mark the starting point of highlighting as starting point of selection, and the endpoint of highlighting as end point of selection?

P.S.: If there is a library with required capabilities in any other language, please let me know as the scenario is not language specific. I need only to get the desired results somehow.

EDIT: Modified the code with Start and End as suggested by Oliver Hanappi. But the problem still lies that if there are two such highlighted phrases, separated only by a white space, the program considers both phrases as one. Simply because it reads the Words and not spaces. May be some edits required around if (thisStart == prevEnd) ?

Upvotes: 3

Answers (4)

tinamou

Reputation: 2291

grahamj42 answer is ok, i've translated it to C#. If you want to find matches in the whole document use:

Word.Range content = thisDoc.Content

But remember that this is only mainStoryRange, if you want to match words in, for example footnotes you need to use:

Word.StoryRanges stories = null;
stories = thisDoc.StoryRanges;
Word.Range footnoteRange = stories[Word.WdStoryType.wdFootnotesStory]

My code:

Word.Find find = null;
Word.Range duplicate = null;
try
{
    duplicate = range.Duplicate;
    find = duplicate.Find;
    find.Highlight = 1;

    object str = "";
    object missing = System.Type.Missing;
    object objTrue = true;
    object replace = Word.WdReplace.wdReplaceNone;

    bool result = find.Execute(ref str, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref objTrue, ref str, ref replace, ref missing, ref missing, ref missing, ref missing);
    while (result)
    {
        // code to store range text
        // use duplicate.Text property
        result = find.Execute(ref str, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref objTrue, ref str, ref replace, ref missing, ref missing, ref missing, ref missing);
    }
}
finally
{
    if (find != null) Marshal.ReleaseComObject(find);
    if (duplicate != null) Marshal.ReleaseComObject(duplicate);
}

Upvotes: 0

Zeeshan

Reputation: 525

I started with Oliver's logic, things seemed to be fine, but testing revealed that this method does not take into account white spaces. So highlighted phrases separated by just a space were not getting separated. I used the VB code approach provided by grahamj42 and added it as a class library and included the reference in my C# windows forms project.

My C# Windows form project:

using Word = Microsoft.Office.Interop.Word;

and then I changed the try block as:

Word.ApplicationClass wordObject = new Word.ApplicationClass();
wordObject.Visible = false;
object file = "D:\\mywordfile.docx";
object nullobject = System.Reflection.Missing.Value;
Word.Document thisDoc = wordObject.Documents.Open(ref file, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject, ref nullobject);

List<string> wordHighlights = new List<string>();


// Let myRange be some Range, which has been already selected programatically here


WordMacroClasses.Highlighting macroObj = new WordMacroClasses.Highlighting();
List<string> hiWords = macroObj.HighlightRange(myRange, myRange.End);
foreach (string hitext in hiWords)
{
    wordHighlights.Add(hitext);
}

And here is the Range.Find code in VB class library which simply accepts the Range and its Range.Last and returns a List(Of String):

Public Class Highlighting
    Public Function HighlightRange(ByVal myRange As Microsoft.Office.Interop.Word.Range, ByVal rangeLimit As Integer) As List(Of String)

        Dim Highlights As New List(Of String)
        Dim i As Integer
        i = 0

        With myRange.Find
            .Highlight = True
            Do While .Execute = True     ' loop while highlighted text is found

                If (myRange.Start < rangeLimit) Then Highlights.Add(myRange.Text)

            Loop
        End With
        Return Highlights
    End Function
End Class

Upvotes: -1

grahamj42

Reputation: 2762

You can do this far more efficiently with Find which will search more quickly and select all the contiguous text which matches. See the reference here http://msdn.microsoft.com/en-us/library/office/bb258967%28v=office.12%29.aspx

Here is an example in VBA which prints all occurrences of highlighted text :

Sub TestFind()

  Dim myRange As Range

  Set myRange = ActiveDocument.Content    '    search entire document

  With myRange.Find

    .Highlight = True

    Do While .Execute = True     '   loop while highlighted text is found

      Debug.Print myRange.Text   '   myRange is changed to contain the found text

    Loop

  End With

End Sub

Hope this helps you understand better.

Upvotes: 2

Oliver Hanappi

Reputation: 12346

You can look at the Start and End properties of the ranges and check whether the end of the first range equals the start of the second.

As an alternative, you may move the range by one word (see WdUnits.wdWord) and then check if the moved start and end equals the start and end of the second word.

Upvotes: 1

How to find highlighted text from Word file in C# using Microsoft.Office.Interop.Word?

Answers (4)

Related Questions