Reputation: 61
I asked this earlier but I wanted to rephrase the question. I am trying to make a scraper for my project. I would like to have it display a certain part of a link. The only part of the link that changes is the number. This number is what I would like to scrape. The link looks like this:
<a href="/link/player.jsp?user=966354" target="_parent" "="">
As mentioned I am trying to scrap only the 966354 part of the link. I have tried several ways to do this but cant figure it out. When I add
<a href="/link/player.jsp?user="
to the code below it breaks
List<string> player = new List<string>();
string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('a')[0].innerHTML");
MatchCollection m1 = Regex.Matches(html, "<a href=\\s*(.+?)\\s*</a>", RegexOptions.Singleline);
foreach (Match m in m1)
{
string players = m.Groups[1].Value;
player.Add(players);
}
listBox.DataSource = player;
So I removed it, it shows no errors until I go to run the program then I get this error:
"An unhandled exception of type 'System.InvalidOperationException' occurred in Awesomium.Windows.Forms.dll"
So I tried this and it some what works:
string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('html')[0].innerHTML");
This code scraps but not the way I would like, Could someone lend a helping hand please.
Upvotes: 0
Views: 786
Reputation: 3787
I would use HtmlAgilityPack (install it via NuGet) and XPath queries to parse HTML.
Something like this:
string html = webControl2.ExecuteJavascriptWithResult("document.getElementsByTagName('html')[0].innerHTML");
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(html);
var playerIds = new List<string>();
var playerNodes = htmlDoc.DocumentNode.SelectNodes("//a[contains(@href, '/link/profile-view.jsp?user=')]");
if (playerNodes != null)
{
foreach (var playerNode in playerNodes)
{
string href = playerNode.Attributes["href"].Value;
var parts = href.Split(new char[] { '=' }, StringSplitOptions.RemoveEmptyEntries);
if (parts.Length > 1)
{
playerIds.Add(parts[1]);
}
}
id.DataSource = playerIds;
}
Also you may find these two simple helper classes useful: https://gist.github.com/AlexP11223/8286153
The first one is extension methods for WebView/WebControl and the second one has some static methods to generate JS code for retrieving elements (JSObject) by XPath + getting coordinates of JSObject)
Upvotes: 1
Reputation: 91
Using a sample html file such as below, I was unable to duplicate the exception.
<html>
<a href="/link/player.jsp?user=966354" target="_parent" "="">test</a>
</html>
However, the javascript
document.getElementsByTagName('a')[0].innerHTML
will return "test" in my example. What you probably want is
document.getElementsByTagName('a')[0].href
which will return the href portion.
The 'innerHTML' property will return everything between the start and end tags (such as <html> </html>). This is probably the reason you have better success when getting the 'html' element - you end up parsing the entire <a> </a> link.
FYI, as a test you can use your browser to test out the javascript output.
Upvotes: 0