Reputation: 239
I am designing a website which scrapes top technology websites such as thenextweb.com, mashable.com and readwriteweb.com etc.
Now one way to scrap using the Html Agility Pack is taking one website let thenextweb.com and fetch its article links and content according to its <tags>
i.e. using <div class ="article-listing"> ..... </div>
and fetch links through that.
In the same manner design algorithm for each and every website (as tags are different for each website).
Here's what I used for getting links from the website thenextweb.com's home page:
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var infos = from info in
document.DocumentNode.SelectNodes("//div[@class='article-listing']")
select new
{
Contr = info.InnerHtml
};
lvLinks.DataSource = infos;
lvLinks.DataBind();
Is there any other simple manner through which I can extract links and content (post and its images, date etc)?
Upvotes: 1
Views: 983
Reputation: 239
I have got the way to extract the links using
more "from" clauses in the LINQ
i can use
var infos = from info in document.DocumentNode.SelectNodes("//div[@class='article-listing']")
from link in info.SelectNodes("h4//a").Where(x => .Attributes.Contains("href"))
select new
{
LinkURL = link.Attributes["href"].value
};
In this way links, images can be fetched.
Thanks...No issue now
Upvotes: 1
Reputation: 138037
All of these sites should have RSS feeds, which are the bast way to get data. For example, The Next Web has these tags (you don't really need the tags, just the URL):
<link rel="alternate" type="application/rss+xml" title="TNW Network All Stories RSS Feed" href="http://feeds2.feedburner.com/thenextweb" />
<link rel="alternate" type="application/rss+xml" title="TNW Network Top Stories RSS Feed" href="http://feeds2.feedburner.com/thenextwebtopstories" />
http://feeds2.feedburner.com/thenextwebtopstories
The feeds should be in the same format (or at least a similar format), which is much easier to understand than raw HTML, and isn't likely to change. You shouldn't have any trouble finding a .Net RSS parser.
Upvotes: 1