BanksySan
BanksySan

Reputation: 28510

Regex to ignore trailing dot if there is one

I have the following regex (crudely matching things that look like URLS)

(https?://\S*)

However, this is to pull out URLs in sentences, so trailing dots are probably end of sentences rather than legitimate parts of the URL.

What's the magic incantation to get the capture group to ignore trailing full stops, commas, colons, semi-colons etc?

(I know that matching URLs is a nightmare, this only needs to support matching them loosely, hence the very simple regex)

Here's my test string:

lorem http://www.example.com lorem https://example.com lorem 
http://www.example.com.
lorem https://example.com.

This should match all the example.com instances.

(I'm testing it with Expresso and .NET)

Test result with trailing dot and new line:

  Expected string length 62 but was 64. Strings differ at index 31.
  Expected: "<a href="http://www.example.com">http://www.example.com</a>.\n\r"
  But was:  "<a href="http://www.example.com.\n">http://www.example.com.\n</a>\r"
  ------------------------------------------^

Example Code

public class HyperlinkParser
{
    private readonly Regex _regex =
        new Regex(
            @"(https?://\S*[^\.])");

    public string Parse(string original)
    {
        var parsed = _regex.Replace(original, "<a href=\"$1\">$1</a>");
        return parsed;
    }
}

Example Tests

[TestFixture]
public class HyperlinkParserTests
{
    private readonly HyperlinkParser _parser = new HyperlinkParser();
    private const string NO_HYPERLINKS = "dummy-text";
    private const string FULL_URL = "http://www.example.com";
    private const string FULL_URL_PARSED = "<a href=\"" + FULL_URL + "\">" + FULL_URL + "</a>";
    private const string FULL_URL_TRAILING_DOT = FULL_URL + ".";
    private const string FULL_URL_TRAILING_DOT_PARSED = "<a href=\"" + FULL_URL + "\">" + FULL_URL + "</a>.";
    private const string TRAILING_DOT_AND_NEW_LINE = FULL_URL_TRAILING_DOT + "\n\r";
    private const string TRAILING_DOT_AND_NEW_LINE_PARSED = FULL_URL_TRAILING_DOT_PARSED + "\n\r";

    private const string COMPLEX_TEXT = "Leading stuff http://www.example.com.  Other stuff.";
    private const string COMPLEX_TEXT_PARSED = "Leading stuff <a href=\"http://www.example.com\">http://www.example.com</a>.  Other stuff.";

    [TestCase(NO_HYPERLINKS, NO_HYPERLINKS)]
    [TestCase(FULL_URL, FULL_URL_PARSED)]
    [TestCase(FULL_URL_TRAILING_DOT, FULL_URL_TRAILING_DOT_PARSED)]
    [TestCase(TRAILING_DOT_AND_NEW_LINE, TRAILING_DOT_AND_NEW_LINE_PARSED)]
    [TestCase(COMPLEX_TEXT, COMPLEX_TEXT_PARSED)]
    public void Parsing(string original, string expected)
    {
        var actual = _parser.Parse(original);

        Assert.That(actual, Is.EqualTo(expected));
    }
}

Upvotes: 2

Views: 1264

Answers (1)

Zsolt Botykai
Zsolt Botykai

Reputation: 51613

Try this, it forbids the dot as the last character:

(https?://\S*[^.])

E.g. under cygwin, with egrep:

$ cat ~/tmp.txt
lorem http://www.example.com lorem https://example.com lorem
http://www.example.com.
lorem https://example.com.
$ cat ~/tmp.txt | egrep -o 'https?://\S*[^.]'
http://www.example.com
https://example.com
http://www.example.com
https://example.com

(The -o option tells egrep to print only matches.)

Upvotes: 3

Related Questions