awesomium web scraping certain parts

Question

I asked this earlier but I wanted to rephrase the question. I am trying to make a scraper for my project. I would like to have it display a certain part of a link. The only part of the link that changes is the number. This number is what I would like to scrape. The link looks like this:

As mentioned I am trying to scrap only the 966354 part of the link. I have tried several ways to do this but cant figure it out. When I add



to the code below it breaks

 List player = new List();
 string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('a')[0].innerHTML");
 MatchCollection m1 = Regex.Matches(html, "", RegexOptions.Singleline);
 foreach (Match m in m1)
 {
    string players = m.Groups[1].Value;
    player.Add(players);
 }
    listBox.DataSource = player;


So I removed it, it shows no errors until I go to run the program then I get this error:

"An unhandled exception of type 'System.InvalidOperationException' occurred in Awesomium.Windows.Forms.dll"

So I tried this and it some what works:

string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('html')[0].innerHTML");    


This code scraps but not the way I would like, Could someone lend a helping hand please.

Alex P. · Accepted Answer

I would use HtmlAgilityPack (install it via NuGet) and XPath queries to parse HTML.

Something like this:

string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('html')[0].innerHTML");
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);

var playerIds = new List();

var playerNodes = htmlDoc.DocumentNode.SelectNodes("//a[contains(@href, '/link/profile-view.jsp?user=')]");

if (playerNodes != null)
{
    foreach (var playerNode in playerNodes)
    {
        string href = playerNode.Attributes["href"].Value;

        var parts = href.Split(new char[] { '=' }, StringSplitOptions.RemoveEmptyEntries);
        if (parts.Length > 1)
        {
            playerIds.Add(parts[1]);
        }
    }

    id.DataSource = playerIds;
}

Also you may find these two simple helper classes useful: https://gist.github.com/AlexP11223/8286153

The first one is extension methods for WebView/WebControl and the second one has some static methods to generate JS code for retrieving elements (JSObject) by XPath + getting coordinates of JSObject)

awesomium web scraping certain parts

Answers (2)

Related Questions