Daniel van wolf
Daniel van wolf

Reputation: 403

How can i extract links from string with html content using htmlagilitypack?

for (int i = 0; i < numberoflinks; i++)
{
    string downloadString = client.DownloadString(mainlink+i+".html");
    var document = new HtmlWeb().Load(url);
    var urls = document.DocumentNode.Descendants("img")
                        .Select(e => e.GetAttributeValue("src", null))
                        .Where(s => !String.IsNullOrEmpty(s))
}      

The problem is that HtmlWeb().Load require a html url but i want to Load the string downloadString which have already the html content inside.

Update:

I tried this now:

for (int i = 0; i < numberoflinks; i++)
            {

                string downloadString = client.DownloadString(mainlink+i+".html");
                HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
                document.Load(downloadString);
                var urls = document.DocumentNode.Descendants("img")
                                                .Select(e => e.GetAttributeValue("src", null))
                                                .Where(s => !String.IsNullOrEmpty(s));
            }

But i'm getting exception on the line:

document.Load(downloadString);

Illegal characters in path

What i'm trying to do is to download/extract all .JPG images from each link. Without download the url first to the hard disk but download the content to a string extract all images links ending with .JPG in this html then download the JPG's.

Upvotes: 1

Views: 306

Answers (1)

David Tansey
David Tansey

Reputation: 6013

You should be able to process a string of HTML using the LoadHtml() method of HtmlDocument.

From the source code:

public void LoadHtml(string html)

Loads the HTML document from the specified string.

param name="html"

String containing the HTML document to load. May not be null.

The Load method expects a filename, which the is reason for the message about illegal characters in path.

Upvotes: 2

Related Questions