ItsLockedOut
ItsLockedOut

Reputation: 239

Ways to extract link and post from a web page using htmlagilitypack in c#?

I am designing a website which scrapes top technology websites such as thenextweb.com, mashable.com and readwriteweb.com etc.

Now one way to scrap using the Html Agility Pack is taking one website let thenextweb.com and fetch its article links and content according to its <tags> i.e. using <div class ="article-listing"> ..... </div> and fetch links through that. In the same manner design algorithm for each and every website (as tags are different for each website).

Here's what I used for getting links from the website thenextweb.com's home page:

var webGet = new HtmlWeb(); 
var document = webGet.Load(url); 
var infos = from info in 
            document.DocumentNode.SelectNodes("//div[@class='article-listing']") 
               select new 
                { 
                Contr = info.InnerHtml 
                 }; 

lvLinks.DataSource = infos; 
lvLinks.DataBind();

Is there any other simple manner through which I can extract links and content (post and its images, date etc)?

Upvotes: 1

Views: 983

Answers (2)

ItsLockedOut
ItsLockedOut

Reputation: 239

I have got the way to extract the links using

more "from" clauses in the LINQ

i can use

var infos = from info in document.DocumentNode.SelectNodes("//div[@class='article-listing']") 
               from link in info.SelectNodes("h4//a").Where(x => .Attributes.Contains("href"))

                select new 
                { 
                LinkURL = link.Attributes["href"].value
                 }; 

In this way links, images can be fetched.

Thanks...No issue now

Upvotes: 1

Kobi
Kobi

Reputation: 138037

All of these sites should have RSS feeds, which are the bast way to get data. For example, The Next Web has these tags (you don't really need the tags, just the URL):

<link rel="alternate" type="application/rss+xml" title="TNW Network All Stories RSS Feed" href="http://feeds2.feedburner.com/thenextweb" />
<link rel="alternate" type="application/rss+xml" title="TNW Network Top Stories RSS Feed" href="http://feeds2.feedburner.com/thenextwebtopstories" />   

http://feeds2.feedburner.com/thenextwebtopstories

The feeds should be in the same format (or at least a similar format), which is much easier to understand than raw HTML, and isn't likely to change. You shouldn't have any trouble finding a .Net RSS parser.

Upvotes: 1

Related Questions