RonyK
RonyK

Reputation: 2674

How to split a Word document by specific text using C# and the Open XML SDK?

I want to split a Word document in two by a specific text programatically using C# and the Open XML SDK. What I've done for the first part is removing all paragraphs up until the paragraph containing the desired text. this worked fine. Then on the copy of the original document I did the same only this time removing all paragraphs starting from the one containing the desired text. For some reason the second part turned out to be an invalid document, that can't be opened using word. Opening the corrupted document with "Open XML SDK 2.0 Productivity Tool" and validating it, doesn't detect any problems with the document.

This is the code removing the part before the desired text (works fine):

public static void DeleteFirstPart(string docName)
    {
        using (WordprocessingDocument document = WordprocessingDocument.Open(docName, true))
        {
            DocumentFormat.OpenXml.Wordprocessing.Document doc = document.MainDocumentPart.Document;

            List<Text> textparts = document.MainDocumentPart.Document.Body.Descendants<DocumentFormat.OpenXml.Wordprocessing.Text>().ToList();
            foreach (Text textfield in textparts)
            {
                if (!textfield.Text.Contains("split here"))
                {
                    RemoveItem1(textfield);
                }
                else
                {
                    break;
                }
            }
        }
    }

I Tried two different remove item methods, both with same result:

private static void RemoveItem1(Text item)
    {
        // Need to go up at least two levels to get to the run.
        if ((item.Parent != null) &&
          (item.Parent.Parent != null) &&
          (item.Parent.Parent.Parent != null))
        {
            var topNode = item.Parent.Parent;
            var topParentNode = item.Parent.Parent.Parent;
            if (topParentNode != null)
            {
                topNode.Remove();
                // No more children? Remove the parent node, as well.
                if (!topParentNode.HasChildren)
                {
                    topParentNode.Remove();
                }
            }
        }
    }


private static void RemoveItem2(Text textfield)
    {
        if (textfield.Parent != null)
        {
            if (textfield.Parent.Parent != null)
            {
                if (textfield.Parent.Parent.Parent != null)
                {
                    textfield.Parent.Parent.Remove();
                }
                else
                {
                    textfield.Parent.Remove();
                }
            }
            else
            {
                textfield.Remove();
            }
        }   
    }

This is the code removing the part starting from the desired text (corrupts the document):

public static void DeleteSecondPart(string docName)
    {
        using (WordprocessingDocument document = WordprocessingDocument.Open(docName, true))
        {
            DocumentFormat.OpenXml.Wordprocessing.Document doc = document.MainDocumentPart.Document;

            List<Text> textparts = document.MainDocumentPart.Document.Body.Descendants<DocumentFormat.OpenXml.Wordprocessing.Text>().ToList();
            bool remove = false;
            foreach (Text textfield in textparts)
            {
                if (textfield.Text.Contains("split here"))
                {
                    remove = true;
                }

                if(remove)
                {
                    RemoveItem1(textfield);
                    //Using this commented code line, instead of the one above, removes only the text field itself, it works fine, the document is valid, but it leaves empty paragraphs that could be pages long.
                    //textfield.Remove();

                }
            }
        }
    }

Upvotes: 2

Views: 2879

Answers (1)

RonyK
RonyK

Reputation: 2674

A rewrite of the RemoveItem method did the trick:

 private static void RemoveItem3(Text textfield)
    {
        OpenXmlElement element = textfield;
        while (!(element.Parent is DocumentFormat.OpenXml.Wordprocessing.Body) && element.Parent != null)
        {
            element = element.Parent;
        }

        if (element.Parent != null)
        {
            element.Remove();
        }
    }

Upvotes: 2

Related Questions