Phil
Phil

Reputation: 723

Parsing Nodes with HTML AgilityPack

I'm trying to get information from that page : http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets

rows look like this when inspecting elements : InspectElement

I've tried this code but it return me null every time on any nodes:

public class ItemSetsTransmog
                {
                    public string ItemSetName { get; set; }
                    public string ItemSetId { get; set; }
                }

                public partial class Fmain : Form
                {
                    DataTable Table;
                    HtmlWeb web = new HtmlWeb();

                    public Fmain()
                    {
                        InitializeComponent();
                        initializeItemSetTransmogTable();

                    }

                    private async void Fmain_Load(object sender, EventArgs e)
                    {
                        int PageNum = 0;
                        var itemsets = await ItemSetTransmogFromPage(0);
                        while (itemsets.Count > 0)
                        {
                            foreach (var itemset in itemsets)
                                Table.Rows.Add(itemset.ItemSetName, itemset.ItemSetId);

                            itemsets = await ItemSetTransmogFromPage(PageNum++);
                        }

                    }

                    private async Task<List<ItemSetsTransmog>> ItemSetTransmogFromPage(int PageNum)
                    {
                        String url = "http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets";
                        if (PageNum != 0)
                            url = "http://www.wowhead.com/transmog-sets?filter=3;5;0#transmog-sets:75+" + PageNum.ToString();

                        var doc = await Task.Factory.StartNew(() => web.Load(url));
                        var NameNodes = doc.DocumentNode.SelectNodes("//*[@id=\"tab - transmog - sets\"]//div//table//tr//td//div//a");
                        var IdNodes = doc.DocumentNode.SelectNodes("//*[@id=\"tab - transmog - sets\"]//div//table//tr//td//div//a");

                        // if these are null it means the name/score nodes couldn't be found on the html page
                        if (NameNodes == null || IdNodes == null)
                            return new List<ItemSetsTransmog>();

                        var ItemSetNames = NameNodes.Select(node => node.InnerText);
                        var ItemSetIds = IdNodes.Select(node => node.InnerText);

                        return ItemSetNames.Zip(ItemSetIds, (name, id) => new ItemSetsTransmog() { ItemSetName = name, ItemSetId = id }).ToList();
                    }

                    private void initializeItemSetTransmogTable()
                    {
                        Table = new DataTable("ItemSetTransmogTable");
                        Table.Columns.Add("ItemSetName", typeof(string));
                        Table.Columns.Add("ItemSetId", typeof(string));

                        ItemSetTransmogDataView.DataSource = Table;
                    }
                }
            }

why does my script doesn't load any of theses nodes ? how can i fix it ?

Upvotes: 0

Views: 125

Answers (1)

SpruceMoose
SpruceMoose

Reputation: 10320

Your code does not load these nodes because they do not exist in the HTML that is pulled back by HTML Agility Pack. This is probably because a large majority of the markup you have shown is generated by JavaScript. Just try inspecting the doc.ParsedText property in your ItemSetTransmogFromPage() method.

Html Agility Pack is an HTTP Client/Parser, it will not run scripts. If you really need to get the data using this process then you will need to use a "headless browser" such as Optimus to retrieve the page (caveat: I have not used this library, though a nuget package appears to exist) and then probably use HTML Agility Pack to parse/query the markup.

The other alternative might be to try to parse the JSON that exists on this page (if this provides you with the data that you need, although this appears unlikely).

Small note - I think the id in you xpath should be "tab-transmog-sets" instead of "tab - transmog - sets"

Upvotes: 1

Related Questions