Reputation: 592
string url = "http://www.myurl.xxx";
HtmlWeb webGet = new HtmlWeb();
HtmlDocument doc = webGet.Load(url);
foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
style.Remove();
string mtext = doc.DocumentNode.InnerText;
The string mtext has no spacing between text where the tags have been removed, how can I 'Remove' AND replace the removed tags with a line break or " " for all tags instances?
Upvotes: 0
Views: 5302
Reputation: 32323
You're just removing the nodes. Instead of this you should replace those nodes with the new ones. This will replace your <script>
and <style>
nodes with a space symbol:
foreach (var node in doc.DocumentNode.SelectNodes("//script|//style").ToArray())
{
var replacement = doc.CreateTextNode(" ");
node.ParentNode.ReplaceChild(replacement, node);
}
Upvotes: 5