Reputation: 8531
I need to get all the content inside the body tag of an HTML file using C#. Are there any good and effective ways of doing this?
Upvotes: 2
Views: 15204
Reputation: 3367
To save you the math in the accepted answer:
var start = html.IndexOf("<body>") + "<body>".Length;
var end = html.IndexOf("</body>");
var result = html.Substring(start, end - start);
Mind that it's not 100% bulletproof:
<body>
<body lang="en">
So all in all you are probably better off with the Agility Pack, unless you know for sure, which HTML you are working with.
Upvotes: 0
Reputation: 813
Reading the Html Structure into Html String and Getting the Body Tag content using C# without HtmlAgility packages
private void Button_Click(object sender, RoutedEventArgs e)
{
string filepath = @"C:\Users\Testing\Documents\sample1.txt";
string htmlString = File.ReadAllText(filepath);
string htmlTagPattern = "<.*?>";
Regex oRegex = new Regex(".*?<body.*?>(.*?)</body>.*?", RegexOptions.Multiline);
htmlString = oRegex.Replace(htmlString, string.Empty);
htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
htmlString = htmlString.Replace(" ", string.Empty);
}
Upvotes: 1
Reputation: 754468
Check out the HTML Agility Pack to do all sorts of HTML manipulation
It gives you an interface somewhat similar to the XmlDocument
XML handling interface:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("/html/body");
if(bodyNode != null)
{
// do something
}
Upvotes: 9
Reputation: 5914
Use XML methods, XPath. For more advanced manipulation with html use HTML Agility pack.
Upvotes: 0
Reputation: 29160
Its easy enough to pull the page code into a string, and simply search for the occurrence of the string "<body" and the string "</body", and just do a little math to get your value...
Upvotes: 2
Reputation: 1038780
You may take a look at SgmlReader and HTML Agility Pack.
Upvotes: 3