Reputation:
So I have a SharePoint site and I have users who submit new items into a SharePoint List. Some fields in the list item contain URLs that reference files or images, e.g. "http://host/abc.jpg"
or "/abc.jpg"
.
In another field, users edit HTML code which may contain any tags such as <a href="/abc.jpg">
, <img src="/abc.jpg">
and so on.
My goal is to find fields that contain links/URLs, and extract those URLs that point to something that has a filename plus extension. I have no problem extracting this from the SharePoint fields which may contain either some irrelevant information or the URL (and the URL only) using these two regexes:
//this will match full url e.g. http://localhost/path/a.jpg
var fullUrlRegex =
new Regex(@"^https?:\/\/(?:.*)[\.]+(?:[a-z0-9]{1,4})$");
//this will match an absolute path like //test/files to upload/222.jpg
var absolutePathRegex =
new Regex(@"^\/.*[\.]+(?:[a-z0-9]{1,4})$");
var fullUrlRegexMatch = fullUrlRegex.Match(value);
var absolutePathRegexMatch = absolutePathRegex.Match(value);
//now check which one matched and save the value
However, I am not sure how to approach extracting URLs (both relative and full URLs) from HTML code that users enter in the other field.
Suppose this is the user's input, and I need to extract both links to files from that HTML code.
<p>This is a <a href="/abc.jpg">picture</a>!
And this is a pic too: <img src="/abc.jpg"></p>
The tags can really be anything, not just limited to <a>
and <img>
. One way I thought I could approach this is to use HTML Agility Pack, but this seems like an overkill. Would it be sufficient to regex-search for src="(match this)"
and href="(match this)"
? Anything I might miss?
Upvotes: 1
Views: 946
Reputation: 340
Your regexes should not contain ^ at the start and $ at the end. It is an achor. See: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx.
Also use Matches method to get all matches.
Upvotes: 1
Reputation: 505
Try this regex
(?<=(href="|src="))[/]*(?:[A-Za-z0-9-._~!$&'()*+,;=:@]|%[0-9a-fA-F]{2})*(?:/(?:[A-Za-z0-9-._~!$&'()*+,;=:@]|%[0-9a-fA-F]{2})*)*
Just add any other valid tags to the list in (href="|src=")
Upvotes: 1