Daniel Lip
Daniel Lip

Reputation: 11335

How to get download file link description too?

link example :

<img src="https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg" alt="This is the description i want to get too" >

and the method i'm using to parse the links from html downloaded source file :

public List<string> GetLinks(string message)
        {
            List<string> list = new List<string>();
            string txt = message;
            foreach (Match item in Regex.Matches(txt, @"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?"))
            {
                if (item.Value.Contains("thumbs"))
                {
                    int index1 = item.Value.IndexOf("mp4");

                    string news = ReplaceLastOccurrence(item.Value, "thumbs", "videos");

                    if (index1 != -1)
                    {
                        string result = news.Substring(0, index1 + 3);
                        if (!list.Contains(result))
                        {
                            list.Add(result);
                        }
                    }
                }
            }

            return list;
        }

but this wil give only the link i want to get also the link description in this example:

This is a test

Then using it :

string[] files = Directory.GetFiles(@"D:\Videos\");
            foreach (string file in files)
            {
                foreach(string text in GetLinks(File.ReadAllText(file)))
                {
                    if (!videosLinks.Contains(text))
                    {
                        videosLinks.Add(text);
                    }
                }
               
            }

and when downloading the links :

private async void btnStartDownload_Click(object sender, EventArgs e)
        {
            if (videosLinks.Count > 0)
            {
                for (int i = 0; i < videosLinks.Count; i++)
                {
                    string fileName = System.IO.Path.GetFileName(videosLinks[i]);
                    await DownloadFile(videosLinks[i], @"D:\Videos\videos\" + fileName);
                }
            }
        }

but the fileName i want to be the description of each link.

Upvotes: 0

Views: 1943

Answers (3)

Lance U. Matthews
Lance U. Matthews

Reputation: 16612

Ibrahim's answer shows how simple this is to do with a proper HTML parser, but I suppose if you just wanted to pull out a single tag from a single page or otherwise didn't want to use an external dependency then regular expressions aren't unreasonable, especially if you can make certain assumptions about the HTML you're matching.

Note that the pattern and code below are just for demonstration purposes and not meant to be a robust, exhaustive tag parser; it's up to the reader to augment them, as needed, to handle all the kinds of HTML quirks and peculiarities they might encounter in the wild, wild web. For example, the pattern will not match image tags with attribute values surrounded by single quotes or no quotes at all, and the code throws an exception if a tag has multiple attributes with the same name.

The way I would do this is with a pattern that will match an <img /> tag and all of its attributes pairs...

<img(?:\s+(?<name>[a-z]+)="(?<value>[^"]*)")*\s*/?>

...which you can then query to find the attributes you care about. You would use that pattern to extract the image attributes into a Dictionary<string, string> like this...

static IEnumerable<Dictionary<string, string>> EnumerateImageTags(string input)
{
    const string pattern =
@"
<img                     # Start of tag
    (?:                  # Attribute name/value pair: noncapturing group
        \s+              # One or more whitespace characters
        (?<name>[a-z]+)  # Attribute name: one or more letters
        =                # Literal equals sign
        ""               # Literal double quote
        (?<value>[^""]*) # Attribute value: zero or more non-double quote characters
        ""               # Literal double quote
    )*                   # Zero or more attributes are allowed
    \s*                  # Zero or more whitespace characters
/?>                      # End of tag with optional forward slash
";

    foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnorePatternWhitespace))
    {
        string[] attributeValues = match.Groups["value"].Captures
            .Cast<Capture>()
            .Select(capture => capture.Value)
            .ToArray();
        // Create a case-insensitive dictionary mapping from each capture of the "name" group to the same-indexed capture of the "value" group
        Dictionary<string, string> attributes = match.Groups["name"].Captures
            .Cast<Capture>()
            .Select((capture, index) => new KeyValuePair<string, string>(capture.Value, attributeValues[index]))
            .ToDictionary(pair => pair.Key, pair => pair.Value, StringComparer.OrdinalIgnoreCase);

        yield return attributes;
    }
}

Given SO74133924.html...

<html>
    <body>
        <p>This image comes from https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg:
        <img src="https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg" alt="This is the description i want to get too">
        <p>This image has additional attributes on multiple lines in a self-closing tag:
        <img
            first="abc"
            src="https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg"
            empty=""
            alt="This image has additional attributes on multiple lines in a self-closing tag"
            last="xyz"
        />
        <p>This image has empty alternate text:
        <img src="https://example.com/?message=This image has empty alternate text" alt="">
        <p>This image has no alternate text:
        <img src="https://example.com/?message=This image has no alternate text">
    </body>
</html>

...you'd consume the attribute dictionary of each tag like this...

static void Main()
{
    string input = File.ReadAllText("SO74133924.html");

    foreach (Dictionary<string, string> imageAttributes in EnumerateImageTags(input))
    {
        foreach (string attributeName in new string[] { "src", "alt" })
        {
            string displayValue = imageAttributes.TryGetValue(attributeName, out string attributeValue)
                ? $"\"{attributeValue}\"" : "(null)";
            Console.WriteLine($"{attributeName}: {displayValue}");
        }
        Console.WriteLine();
    }
}

...which outputs this...

src: "https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg"
alt: "This is the description i want to get too"

src: "https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg"
alt: "This image has additional attributes on multiple lines in a self-closing tag"

src: "https://example.com/?message=This image has empty alternate text"
alt: ""

src: "https://example.com/?message=This image has no alternate text"
alt: (null)

Upvotes: 1

Ibrahim Timimi
Ibrahim Timimi

Reputation: 3750

You can use Html Agility Pack which is an HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. In the example below you can retrieve the description in alt attribute and others.

Implementation:

using HtmlAgilityPack;
using System;
                    
public class Program
{
    public static void Main()
    {
        HtmlDocument doc = new HtmlDocument();
        var html = "<img src=\"https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg\" alt=\"This is the description i want to get too\" >";
        doc.LoadHtml(html);
        HtmlNode image = doc.DocumentNode.SelectSingleNode("//img");

        Console.WriteLine("Source: {0}", image.Attributes["src"].Value);
        Console.WriteLine("Description: {0}", image.Attributes["alt"].Value);
        Console.Read();
    }
}

Demo:
https://dotnetfiddle.net/nAAZDL

Output:

Source: https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg
Description: This is the description i want to get too

Upvotes: 1

Anirudha Gupta
Anirudha Gupta

Reputation: 9299

If you use the code using regex, it will take more CPU cycle and perform slow. Use some library like AngleSharp.

I tried to write your code in AngleSharp. This is how I did it.

        string test = "<img src=\"https://thumbs.com/thumbs/test.mp4/test1.mp4-3.jpg\" alt=\"This is the description i want to get too\" >\r\n";
        var configuration = Configuration.Default.WithDefaultLoader();
        var context = BrowsingContext.New(configuration);
        using var doc = await context.OpenAsync(req => req.Content(test));

        string href = doc.QuerySelector("img").Attributes["alt"].Value;

Upvotes: -1

Related Questions