Mike Perrenoud
Mike Perrenoud

Reputation: 67898

Finding Link Text with Regular Expressions

Team:

I need some help with some regular expressions. The goal is to be able to identify three different ways that users might express links in a note, and those are as follows.

<a href="http://www.msn.com">MSN</a>

possibilities

    http://www.msn.com     OR
    https://www.msn.com    OR
    www.msn.com

Then by being able to find them I can change each one of them to real A tags as necessary. I realize the first example is already an A tag but I need to add some attributes to it specific to our application -- such as TARGET and ONCLICK.

Now, I have regular expressions that can find each one of those individually, and those are as follows, respective to the examples above.

<a?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*)/?>
(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?
[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?

But the problem is that I can't run all of them on the string because the second one will match a part of the first one and the third one will match a part of both the first and second. At any rate -- I need to be able to find the three permutations distinctly so I can replace each one of them individually -- because the third expression for example will need http:// added to it.

I look forward to everybodys assistance!

Upvotes: 0

Views: 264

Answers (2)

jCoder
jCoder

Reputation: 2319

Assuming that the link starts or ends either with a space or at beginnd/end of line (or inside an existing A tag) I came up with the following code, which also includes some sample texts:

string regexPattern = "((?:<a (?:.*?)href=\")|^|\\s)((?:http[s]?://)?(?:\\S+)(?:\\.(?:\\S+?))+?)((?:\"(?:.*?)>(.*?)</a>)|\\s|$)";
string[] examples = new string[] {
    "some text <a href=\"http://www.msn.com/path/file?page=some.page&subpage=9#jump\">MSN</a>  more text",
    "some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text https://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text www.msn.com/path/file?page=some.page&subpage=9#jump",
    "www.msn.com/path/file?page=some.page&subpage=9#jump more text"
};

Regex re = new Regex(regexPattern);
foreach (string s in examples) {
    MatchCollection mc = re.Matches(s);
    foreach (Match m in mc) {
        string prePart = m.Groups[1].Value;
        string actualLink = m.Groups[2].Value;
        string postPart = m.Groups[3].Value;
        string linkText = m.Groups[4].Value;
        MessageBox.Show(" prePart: '" + prePart + "'\n actualLink: '" + actualLink + "'\n postPart: '" + postPart + "'\n linkText: '" + linkText + "'");
    }
}

As this code uses groups with numbers it should be possible to use the regular expression in JavaScript too.

Depending on what you need to do with the existing A tag you need to parse the particular first group as well.

Update: Modified the regex as requested so that the link Text becomes group no. 4

Update 2: To better catch malformed links you might try this modified version:

pattern = "((?:<a (?:.*?)href=\"?)|^|\\s)((?:http[s]?://)?(?:\\S+)(?:\.(?:[^>\"\\s]+))+)((?:\"?(?:.*?)>(.*?)</a>)|\\s|$)";

Upvotes: 1

Bruno Silva
Bruno Silva

Reputation: 3097

Well, if we want to do in a single pass, you could create name groups for each scenario:

(?<full><a?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*)/?>.*</a>)|
(?<url>(http|https)://[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)|
(<?www>[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)

Then you would have to check which was the matched group:

Match match = regex.Match(pattern);

if (match.Success)
{
    if (match.Groups["full"].Success) 
       Console.WriteLine(match.Groups["full"].Value);
    else if (match.Groups["url"].Success)
    ....
}

Upvotes: 0

Related Questions