user6269864
user6269864

Reputation:

Regex or another way to extract full URLs + relative URLs from HTML

So I have a SharePoint site and I have users who submit new items into a SharePoint List. Some fields in the list item contain URLs that reference files or images, e.g. "http://host/abc.jpg" or "/abc.jpg".

In another field, users edit HTML code which may contain any tags such as <a href="/abc.jpg">, <img src="/abc.jpg"> and so on.

My goal is to find fields that contain links/URLs, and extract those URLs that point to something that has a filename plus extension. I have no problem extracting this from the SharePoint fields which may contain either some irrelevant information or the URL (and the URL only) using these two regexes:

//this will match full url e.g. http://localhost/path/a.jpg
var fullUrlRegex = 
            new Regex(@"^https?:\/\/(?:.*)[\.]+(?:[a-z0-9]{1,4})$");
//this will match an absolute path like //test/files to upload/222.jpg
var absolutePathRegex =
            new Regex(@"^\/.*[\.]+(?:[a-z0-9]{1,4})$");

var fullUrlRegexMatch = fullUrlRegex.Match(value);
var absolutePathRegexMatch = absolutePathRegex.Match(value);

//now check which one matched and save the value

However, I am not sure how to approach extracting URLs (both relative and full URLs) from HTML code that users enter in the other field.

Suppose this is the user's input, and I need to extract both links to files from that HTML code.

<p>This is a <a href="/abc.jpg">picture</a>! 
And this is a pic too: <img src="/abc.jpg"></p>

The tags can really be anything, not just limited to <a> and <img>. One way I thought I could approach this is to use HTML Agility Pack, but this seems like an overkill. Would it be sufficient to regex-search for src="(match this)" and href="(match this)"? Anything I might miss?

Upvotes: 1

Views: 946

Answers (2)

rholek
rholek

Reputation: 340

Your regexes should not contain ^ at the start and $ at the end. It is an achor. See: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx.

Also use Matches method to get all matches.

Upvotes: 1

tcwicks
tcwicks

Reputation: 505

Try this regex

(?<=(href="|src="))[/]*(?:[A-Za-z0-9-._~!$&'()*+,;=:@]|%[0-9a-fA-F]{2})*(?:/(?:[A-Za-z0-9-._~!$&'()*+,;=:@]|%[0-9a-fA-F]{2})*)*

Just add any other valid tags to the list in (href="|src=")

Upvotes: 1

Related Questions