Reputation: 67898
Team:
I need some help with some regular expressions. The goal is to be able to identify three different ways that users might express links in a note, and those are as follows.
<a href="http://www.msn.com">MSN</a>
possibilities
http://www.msn.com OR https://www.msn.com OR www.msn.com
Then by being able to find them I can change each one of them to real A tags as necessary. I realize the first example is already an A tag but I need to add some attributes to it specific to our application -- such as TARGET and ONCLICK.
Now, I have regular expressions that can find each one of those individually, and those are as follows, respective to the examples above.
<a?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*)/?>
(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?
[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?
But the problem is that I can't run all of them on the string because the second one will match a part of the first one and the third one will match a part of both the first and second. At any rate -- I need to be able to find the three permutations distinctly so I can replace each one of them individually -- because the third expression for example will need http:// added to it.
I look forward to everybodys assistance!
Upvotes: 0
Views: 264
Reputation: 2319
Assuming that the link starts or ends either with a space or at beginnd/end of line (or inside an existing A
tag) I came up with the following code, which also includes some sample texts:
string regexPattern = "((?:<a (?:.*?)href=\")|^|\\s)((?:http[s]?://)?(?:\\S+)(?:\\.(?:\\S+?))+?)((?:\"(?:.*?)>(.*?)</a>)|\\s|$)";
string[] examples = new string[] {
"some text <a href=\"http://www.msn.com/path/file?page=some.page&subpage=9#jump\">MSN</a> more text",
"some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
"some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
"some text https://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
"some text www.msn.com/path/file?page=some.page&subpage=9#jump",
"www.msn.com/path/file?page=some.page&subpage=9#jump more text"
};
Regex re = new Regex(regexPattern);
foreach (string s in examples) {
MatchCollection mc = re.Matches(s);
foreach (Match m in mc) {
string prePart = m.Groups[1].Value;
string actualLink = m.Groups[2].Value;
string postPart = m.Groups[3].Value;
string linkText = m.Groups[4].Value;
MessageBox.Show(" prePart: '" + prePart + "'\n actualLink: '" + actualLink + "'\n postPart: '" + postPart + "'\n linkText: '" + linkText + "'");
}
}
As this code uses groups with numbers it should be possible to use the regular expression in JavaScript too.
Depending on what you need to do with the existing A
tag you need to parse the particular first group as well.
Update: Modified the regex as requested so that the link Text becomes group no. 4
Update 2: To better catch malformed links you might try this modified version:
pattern = "((?:<a (?:.*?)href=\"?)|^|\\s)((?:http[s]?://)?(?:\\S+)(?:\.(?:[^>\"\\s]+))+)((?:\"?(?:.*?)>(.*?)</a>)|\\s|$)";
Upvotes: 1
Reputation: 3097
Well, if we want to do in a single pass, you could create name groups for each scenario:
(?<full><a?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*)/?>.*</a>)|
(?<url>(http|https)://[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)|
(<?www>[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)
Then you would have to check which was the matched group:
Match match = regex.Match(pattern);
if (match.Success)
{
if (match.Groups["full"].Success)
Console.WriteLine(match.Groups["full"].Value);
else if (match.Groups["url"].Success)
....
}
Upvotes: 0