Reputation: 929
I am using this regex to math all contents of href's
on a page:
(?:href)=[\"|']?(.*?)[\"|'|>]+
It works fine. But i want to match only links that are not media like (png|jpg|avi|wav|gif) etc.
I tried something like adding
((?!png).)
to my regex, but this did not work. I read this question but could not get any working solution.
Upvotes: 1
Views: 603
Reputation: 276406
I know this question was already answered.
I'd like to offer a different approach using CsQuery instead of HtmlAgilityPack
I think the syntax is more compact and is very similar to other structures since it's based on LINQ
//input is your input HTML string
var links = CQ.Create(input).Find("a").Select(x=>x.Cq().Attr("href"));
For example
var links = CQ.Create("<div><a href='blah'></a><a href='blah2'></a></div>").Find("a").Select(x=>x.Cq().Attr("href"));
Console.Write(string.Join(",",dom)); //prints blah,blah2
Hope this helps anyone :)
Upvotes: 3
Reputation: 7043
using HtmlAgilityPack;
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
List<string> href = new List<string>();
private void addHREF()
{
//put your input to check
string input = "";
doc.LoadHtml(input);
//Which files ignore?
string[] stringArray = { ".png", ".jpg" };
foreach (var item in doc.DocumentNode.SelectNodes("//a"))
{
string value = item.Attributes["href"].Value;
if (stringArray.Any(value.Contains) == false)
href.Add(value);
}
}
I tested with my input works great... if you have any problem let me know..
Upvotes: 2
Reputation: 13641
My effort
@"(?<=\shref\s*=\s*[""']?)(?![""']|\S+\.(?:png|jpg|avi|wav|gif)[""']?[\s>])\S+?(?=[""']?[\s>])";
It uses a positive look-behind to locate the content, and a negative lookahead to make sure it doesn't contain a dot followed by either of png jpg avi wav gif followed by an optional quote mark and a space or >
. It then matches up until an optional quote mark followed by a space or >
. The content does not have to be quoted but it must not contain whitespace.
Upvotes: 1
Reputation: 25855
Even though I recommend against this approach, you may find this regex helpful:
(?<=href\s*=\s*['"]?)(?>(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w\.-]*)*/?)(?<!png|gif|etc)
(Based on URL regex from 8 Regular Expressions You Should Know)
Note that this expression will not allow spaces in the URL. This is because HREF's without quotes will match the following attribute (for example, "domain.com/resource.txt title"
)
EXAMPLE:
static void Main( string[] args )
{
string l_input =
"<a href=\n" +
" \"HTTPS://example.com/page.html\" title=\"match\" />\n" +
"<a href='http://site.com/pic.png' title='do not match'> <a href=domain.com/resource.txt title=match>\n" +
" <script src=scripts.com/script.js>";
foreach ( Match l_match in Regex.Matches( l_input, @"(?<=href\s*=\s*['""]?)(?>(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w\.-]*)*/?)(?<!png|gif|etc)", RegexOptions.IgnoreCase ) )
Console.WriteLine( "'" + l_match.Value + "'" );
/*
* Returns:
*
* HTTPS://example.com/page.html
* domain.com/resource.txt
*
*/
Console.ReadKey( true );
}
Upvotes: 1