pila
pila

Reputation: 929

Regular expression to match href, but no media files

I am using this regex to math all contents of href's on a page:

(?:href)=[\"|']?(.*?)[\"|'|>]+

It works fine. But i want to match only links that are not media like (png|jpg|avi|wav|gif) etc.

I tried something like adding

((?!png).)

to my regex, but this did not work. I read this question but could not get any working solution.

Upvotes: 1

Views: 603

Answers (4)

Benjamin Gruenbaum
Benjamin Gruenbaum

Reputation: 276406

I know this question was already answered.

I'd like to offer a different approach using CsQuery instead of HtmlAgilityPack

I think the syntax is more compact and is very similar to other structures since it's based on LINQ

//input is your input HTML string
var links = CQ.Create(input).Find("a").Select(x=>x.Cq().Attr("href"));

For example

var links = CQ.Create("<div><a href='blah'></a><a href='blah2'></a></div>").Find("a").Select(x=>x.Cq().Attr("href"));
Console.Write(string.Join(",",dom)); //prints blah,blah2

Hope this helps anyone :)

Upvotes: 3

a1204773
a1204773

Reputation: 7043

using HtmlAgilityPack;

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
List<string> href = new List<string>();

private void addHREF()
{
    //put your input to check
    string input = "";

    doc.LoadHtml(input);
    //Which files ignore?
    string[] stringArray = { ".png", ".jpg" };
    foreach (var item in doc.DocumentNode.SelectNodes("//a"))
    {
        string value = item.Attributes["href"].Value;
        if (stringArray.Any(value.Contains) == false)
            href.Add(value);
    }
}

I tested with my input works great... if you have any problem let me know..

Upvotes: 2

MikeM
MikeM

Reputation: 13641

My effort

@"(?<=\shref\s*=\s*[""']?)(?![""']|\S+\.(?:png|jpg|avi|wav|gif)[""']?[\s>])\S+?(?=[""']?[\s>])";

It uses a positive look-behind to locate the content, and a negative lookahead to make sure it doesn't contain a dot followed by either of png jpg avi wav gif followed by an optional quote mark and a space or >. It then matches up until an optional quote mark followed by a space or >. The content does not have to be quoted but it must not contain whitespace.

Upvotes: 1

JDB
JDB

Reputation: 25855

Even though I recommend against this approach, you may find this regex helpful:

(?<=href\s*=\s*['"]?)(?>(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w\.-]*)*/?)(?<!png|gif|etc)

(Based on URL regex from 8 Regular Expressions You Should Know)

Note that this expression will not allow spaces in the URL. This is because HREF's without quotes will match the following attribute (for example, "domain.com/resource.txt title")

EXAMPLE:

static void Main( string[] args )
{

    string l_input =
        "<a href=\n" +
        "        \"HTTPS://example.com/page.html\" title=\"match\" />\n" +
        "<a href='http://site.com/pic.png' title='do not match'> <a href=domain.com/resource.txt title=match>\n" +
        " <script src=scripts.com/script.js>";

    foreach ( Match l_match in Regex.Matches( l_input, @"(?<=href\s*=\s*['""]?)(?>(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w\.-]*)*/?)(?<!png|gif|etc)", RegexOptions.IgnoreCase ) )
        Console.WriteLine( "'" + l_match.Value + "'" );

    /* 
     * Returns:
     * 
     * HTTPS://example.com/page.html
     * domain.com/resource.txt
     *          
     */

    Console.ReadKey( true );

}

Upvotes: 1

Related Questions