MZB
MZB

Reputation: 2111

Walking a WebBrowser control DOM - elements with both children and text

I'm trying to walk the DOM of a WebBrowser control using C# and performing some processing each HtmlElement. (I'm doing some transformations on the DOM at the same time, but for this discussion assume that I am trying to flatten the DOM by walking each node recursively )

When I encounter something like:

<p>Text with a <a href="http://www.example.com/">link</a> in the middle of it </p>

I find an HtmlElement for the P tag (which contains the expected InnerText) and a child HtmlElement node corresponding to the tag A. The HtmlElement for the A tag contains the expected inner text.

But I cannot find any structures or attributes related just to the text before and after the A tag.

Is there a way to find the text before and after the text of the A tag other than the dreadful hack of comparing the InnerHtml property of the P tag with the OuterHtml property of the A tag?

Or is there another way to walk the IE DOM?

Upvotes: 1

Views: 1044

Answers (1)

Sheng Jiang 蒋晟
Sheng Jiang 蒋晟

Reputation: 15281

To get text nodes in the DOM, QI (a type cast in C#) the parent element (HtmlElement.DomElement in Windows Forms) for mshtml.IHTMLDOMNode.

Then you can get direct child nodes via IHTMLDOMNode.childNodes. You then enumerate the IHTMLDOMNode.childNodes collection, look for node whose type is 3 (text). If you want to look for text nodes in child elements as well, repeat this for type 1 child nodes.

Upvotes: 1

Related Questions