Finding Link Text with Regular Expressions

Question

Team:

I need some help with some regular expressions. The goal is to be able to identify three different ways that users might express links in a note, and those are as follows.

MSN

possibilities

    http://www.msn.com     OR
    https://www.msn.com    OR
    www.msn.com

Then by being able to find them I can change each one of them to real A tags as necessary. I realize the first example is already an A tag but I need to add some attributes to it specific to our application -- such as TARGET and ONCLICK.

Now, I have regular expressions that can find each one of those individually, and those are as follows, respective to the examples above.

\s]+))?)+\s*)/?>
(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?
[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?

But the problem is that I can't run all of them on the string because the second one will match a part of the first one and the third one will match a part of both the first and second. At any rate -- I need to be able to find the three permutations distinctly so I can replace each one of them individually -- because the third expression for example will need http:// added to it.

I look forward to everybodys assistance!

jCoder · Accepted Answer

Assuming that the link starts or ends either with a space or at beginnd/end of line (or inside an existing A tag) I came up with the following code, which also includes some sample texts:

string regexPattern = "((?:(.*?))|\s|$)";
string[] examples = new string[] {
    "some text MSN  more text",
    "some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text https://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text www.msn.com/path/file?page=some.page&subpage=9#jump",
    "www.msn.com/path/file?page=some.page&subpage=9#jump more text"
};

Regex re = new Regex(regexPattern);
foreach (string s in examples) {
    MatchCollection mc = re.Matches(s);
    foreach (Match m in mc) {
        string prePart = m.Groups[1].Value;
        string actualLink = m.Groups[2].Value;
        string postPart = m.Groups[3].Value;
        string linkText = m.Groups[4].Value;
        MessageBox.Show(" prePart: '" + prePart + "'
 actualLink: '" + actualLink + "'
 postPart: '" + postPart + "'
 linkText: '" + linkText + "'");
    }
}

As this code uses groups with numbers it should be possible to use the regular expression in JavaScript too.

Depending on what you need to do with the existing A tag you need to parse the particular first group as well.

Update: Modified the regex as requested so that the link Text becomes group no. 4

Update 2: To better catch malformed links you might try this modified version:

pattern = "((?:(.*?))|\s|$)";

Finding Link Text with Regular Expressions

Answers (2)

Related Questions