Reputation: 81
I've wrote a grabber for imdb web-site and now I need to parse the pages. I'm going to do it with HtmlAgilityPack.
For example, I've downloaded this page: link to IMDb
and I've saved it as @"D:\IMDb.htm" From this page I need to take the line, where the usefulness of the review is specified, e.g. 1770 out of 2062 people found the following review useful: from the first review.
My code is next, I hope the Xpath is correct, but my Node is NULL in the end(
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using HtmlAgilityPack;
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.LoadHtml("D:\\IMDb.htm");
Console.WriteLine("res", GetDescription("D:\\IMDb.htm"));
Console.ReadLine();
}
public static string GetDescription(string html)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.OptionFixNestedTags = true;
doc.Load(new StringReader(html));
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[@id='tn15content']/div[1]/small[1]");
return node.InnerHtml;
}
Hope to see some help from you, because I don't understand what's wrong..
Upvotes: 1
Views: 924
Reputation: 89285
You shouldn't use StringReader
here because html
variable contains path to the HTML file to be loaded instead of the HTML markup it self :
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.OptionFixNestedTags = true;
doc.Load(html);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//*[@id='tn15content']/div[1]/small[1]");
return node.InnerHtml;
Even if html
contains the markup you can use HAP's built-in function doc.LoadHtml(html)
.
Upvotes: 1