HTMLAgilityPack selects nodes from first iteration through divs

Question

I'm trying to use HTMLAgilityPack to parse some website for the first time. Everything works as expected but only for first iteration. On each iteration I get unique div with its data, but SelectNodes() always gets data from first iteration. The code listed below explains the problem

All the properties for station get values from first iteration.

  static void Main(string[] args)
    {
        List stations = new List();

        wClient = new WebClient();
        wClient.Proxy = null;
        wClient.Encoding = encode;

        for (int i = 1; i <= 1; i++)
        {
            HtmlDocument html = new HtmlDocument();
            string link = string.Format("http://energybase.ru/powerPlant/index?PowerPlant_page={0}&pageSize=20&q=/powerPlant", i);
            html.LoadHtml(wClient.DownloadString(link));
            var stationList = html.DocumentNode.SelectNodes("//div[@class='items']").First().ChildNodes.Where(x=>x.Name=="div").ToList();//get list of nodes with PowerStation Data
            foreach (var item in stationList) //each iteration returns Item with unique InnerHTML
            {
                Station st = new Station();

                st.Name = item.SelectNodes("//div[@class='col-md-20']").First().SelectNodes("//div[@class='name']").First().ChildNodes["a"].InnerText;//gets name from first iteration
                st.Url = item.SelectNodes("//div[@class='col-md-20']").First().SelectNodes("//div[@class='name']").First().ChildNodes["a"].Attributes["href"].Value;//gets url from first iteration and so on
                st.Company = item.SelectNodes("//div[@class='col-md-20']").First().SelectNodes("//div[@class='name']").First().ChildNodes["small"].ChildNodes["em"].ChildNodes["a"].InnerText;

                stations.Add(st);
            }

        }

Maybe I am not getting some of essentials of OOP?

Alexander Petrov · Accepted Answer

Your code can be greatly simplified by using the full power of XPath.

var stationList = html.DocumentNode.SelectNodes("//div[@class='items']/div");
// XPath-expression may be so:                  "//div[@class='items'][1]/div"
// where [1] means first node

foreach (var item in stationList)
{
    Station st = new Station();

    st.Name = item.SelectSingleNode("div[@class='col-md-20']/div[@class='name']/a").InnerText;

    st.Url = item.SelectSingleNode("div[@class='col-md-20']/div[@class='name']/a").Attributes["href"].Value;

    string rawText = item.SelectSingleNode("div[@class='col-md-20']/div[@class='name']/small/em").InnerText;
    st.Company = HttpUtility.HtmlDecode(rawText.Trim());

    stations.Add(st);
}

Your mistake was to use XPath descendants axis: //div.

Even better rewrite code like this:

var divName = item.SelectSingleNode("div[@class='col-md-20']/div[@class='name']");
var nodeA = divName.SelectSingleNode("a");

st.Name = nodeA.InnerText;
st.Url = nodeA.Attributes["href"].Value;

string rawText = divName.SelectSingleNode("small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());

HTMLAgilityPack selects nodes from first iteration through divs

Answers (1)

Related Questions