How to get only files from entire html read in a c# console app?

I need to get every single file from a URL so then I can iterate over them.

The idea is to resize each image using ImageMagick, but first I need to be able to get the files and iterate over them.

Here is the code I have done so far

using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;



namespace Example
{
    public class MyExample
    {

        public static void Main(String[] args)
        {
            string url = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal/";
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
            {
                using (StreamReader reader = new StreamReader(response.GetResponseStream()))
                {
                    string html = reader.ReadToEnd();
                    Console.WriteLine(html);

                }
            }

            Console.ReadLine();
        }
    }
}

Which returns the entire html of the URL. However, I just need the files (all images) so I can work with them As I expect.

Any idea how to achieve this?

Upvotes: 0

Answers (3)

Arash Motamedi

Reputation: 10682

I looked at that page, and it's a directory/file list. You can use Regex to extract all links to images from the body of that page.

Here's a pattern I could think of: HREF="([^"]+\.(jpg|png))

Build your regex object, iterate over the matches, and download each image:

var regex = new System.Text.RegularExpressions.Regex("HREF=\"([^\"]+\\.(jpg|png))");
var matches = regex.Matches(html); // this is your html string
foreach(var match in matches) {
   var imagePath = match.ToString().Substring("HREF=\"".Length);
   Console.WriteLine(imagePath);
}

Now, concatenate the base url https://www.paz.cl with the image relative path obtained above, issue another request to that url to download the image and process it as you wish.

Upvotes: 1

meziantou

Reputation: 21337

You can use AngleSharp to load and parse the html page. Then you can extract all the information you need.

// TODO add a reference to NuGet package AngleSharp
private static async Task Main(string[] args)
{
    var config = Configuration.Default.WithDefaultLoader();
    var address = "https://www.paz.cl/imagenes_cotizador/BannerPrincipal";
    var context = BrowsingContext.New(config);
    var document = await context.OpenAsync(address);
    var images = document.Images.Select(img=>img.Source);

}

AngleSharp implements the w3c standard, so it works better than HTMLAgilityPack on real world webpage.

Upvotes: 0

ahmeticat

Reputation: 1939

You can use The HTML Agility Pack

for example

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//a");

foreach (var node in htmlNodes)
{   
    Console.WriteLine(node.Attributes["href"].Value);
}

Upvotes: 0

How to get only files from entire html read in a c# console app?

Answers (3)

Related Questions