Reputation: 8857
I have the following HTML
(..)
<tbody>
<tr>
<td class="name"> Test1 </td>
<td class="data"> Data </td>
<td class="data2"> Data 2 </td>
</tr>
<tr>
<td class="name"> Test2 </td>
<td class="data"> Data2 </td>
<td class="data2"> Data 2 </td>
</tr>
</tbody>
(..)
The information I have is the name => so "Test1" & "Test2". What I want to know is how can I get the data that's in "data" and "data2" based on the Name I have.
Currently I'm using:
var data =
from
tr in doc.DocumentNode.Descendants("tr")
from
td in tr.ChildNodes.Where(x => x.Attributes["class"].Value == "name")
where
td.InnerText == "Test1"
select tr;
But I get {"Object reference not set to an instance of an object."}
when I try to look in data
Upvotes: 12
Views: 22953
Reputation: 57
I can recommend one of two ways:
http://htmlagilitypack.codeplex.com/, which converts the html to valid xml which can then be queried against with OOTB Linq.
Or,
Linq to HTML (http://www.superstarcoders.com/linq-to-html.aspx), which while not maintained on CodePlex ( that was a hint, Keith ), gives a reasonable working set of features to springboard from.
Upvotes: -1
Reputation: 138841
Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)
This function gets all data values associated with a name:
public static IEnumerable<string> GetData(HtmlDocument document, string name)
{
return from HtmlNode node in
document.DocumentNode.SelectNodes("//td[@class='name' and contains(text(), '" + name + "')]/following-sibling::td")
select node.InnerText.Trim();
}
For example, this code will dump all 'Test2' data:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtml);
foreach (string data in GetData(doc, "Test2"))
{
Console.WriteLine(data);
}
Upvotes: 5
Reputation: 14321
instead of
td.InnerText == "Test1"
try
td.InnerText == " Test1 "
or
d.InnerText.Trim() == "Test1"
Upvotes: -1
Reputation: 138007
As for your attempt, you have two issues with your code:
ChildNodes
is weird - it also returns whitespace text nodes, which don't have a class
attributes (can't have attributes, of course).With these two corrections, the following works:
var data =
from tr in doc.DocumentNode.Descendants("tr")
from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value == "name")
where td.InnerText.Trim() == "Test1"
select tr;
Upvotes: 16
Reputation: 138007
Here's one approach - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
.SelectNodes("//table[@id='MyTable']//tr");
var data = nodes.Select(
node => node.Descendants("td")
.ToDictionary(descendant => descendant.Attributes["class"].Value,
descendant => descendant.InnerText.Trim())
).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];
Here I turn every <tr>
to a dictionary, where the class of the <td>
is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name
of every <tr>
is the key.
Upvotes: 1