Reputation: 6547
I want to extract some information from a website and I use HtmlAgilityPack
and linq
to create queries on the HTML.
Here in this particular example I want to get the value of the m_name in the href attribute in the A-tag and then the value of the src attribute in the IMG tag.
<A href="/index.php?lang=eng&ssid=&wbid=&refid=website.com&mref=&showall=0&Submit=m_info&refname=&id=37447&m_name=LacosteShoe">
<DIV name="prdiv1" id="prdiv1" overflow:hidden;">
<IMG name="pic1" id="pic1" class=pic_2 alt="for sale here for 2 days" title="for sale here for 2 days" src="item/preview/37447_pr2.jpg?55995" >
</DIV>
</A>
I would like to make a List<string,string>
of these values such that in this example that
list.add("LacosteShoe","item/preview/37447_pr2.jpg?55995");
Is it possible to do this in a linq query? It is far to advanced to my beginners knowledge. Also I would have to make sure that it doesn't fail if for example the attribute href doesn't exist.
I basically got this so far:
var query = document.DocumentNode.Descendants("a")
.Where(a => a.Attributes["href"].Value.Contains("m_name=")
Select();
Upvotes: 0
Views: 157
Reputation: 5495
var query = document.DocumentNode.Descendants("a")
.Where(a => a.Attributes["href"].Value.Contains("m_name=")
.Select(b => new {Name=ExtractName(b.Attributes["href"].Value),
Link=b.Descendants("div").First()
.Descendants("img").First().Attributes["src"].Value}).ToList();
Define the function ExtractName(string str);
to extract the name from the href value. You can use Regex for this.
Upvotes: 2
Reputation: 3281
Try
List<string> products = document.DocumentNode.Descendants("a")
.Where(a => a.Attributes["href"] != null
&&a.Attributes["href"].Value.Contains("m_name=")).Select(l =>
l.Attributes["href"].Substring(l.Attributes["href"].IndexOf("m_name=") + 7)).ToList();
Upvotes: 1