saj
saj

Reputation: 4796

Regex to get url from HTML

I'm using the following Regex (which I found online) to obtain the urls within a HTML page;

        Regex regex = new Regex(@"url\((?<char>['""])?(?<url>.*?)\k<char>?\)");

Works fine for the HTML below;

<div style="background:url(images/logo.png) no-repeat;">UK</div>

However returns more than I need when the HTML page contained the following Javascript, returning 'destpage'

function buildurl(destpage) 

I tried the following regex to include a colon, but it appears to be invalid

:url\((?<char>['""])?(?<:url>.*?)\k<char>?\)

Any help would be much appreciated.

Upvotes: 0

Views: 412

Answers (2)

user2586804
user2586804

Reputation: 321

Only add the colon to the front:

:url\((?<char>['""])?(?<url>.*?)\k<char>?\)

The second "url" is the name of that group.

Upvotes: 0

keyboardP
keyboardP

Reputation: 69362

To get all the URLs, use the HtmlAgilityPack instead of a Regex. From their example page

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{

}

You can expand on that to obtain your style urls by, for example, using //@style to get the style nodes and iterating through those to extract the url value.

Upvotes: 3

Related Questions